This dataset reports flights in the United States, including carriers, arrival and departure delays, and reasons for delays, from 1987 to 2008. I have limted the review to just 3 years due to the size of the dataset. The different dataset can be obtained from this link; Flight Data
The different variables are described as follow:
# import all packages and set plots to be embedded inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sb
import datetime as dt
import time
import sys
import os
import bz2
import warnings
warnings.filterwarnings("ignore")
%matplotlib inline
sb.set_style('darkgrid')
# Extracting csv files from the downloaded BZip2 files.
# compressed_file = ['c:\\users\\okpokou\\Documents\\udacity\\Project 3\\Compressed files\\1987.csv.bz2',
# 'c:\\users\\okpokou\\Documents\\udacity\\Project 3\\Compressed files\\1988.csv.bz2',
# 'c:\\users\\okpokou\\Documents\\udacity\\Project 3\\Compressed files\\1989.csv.bz2']
# new_file = ['c:\\users\\okpokou\\Documents\\udacity\\Project 3\\1987.csv',
# 'c:\\users\\okpokou\\Documents\\udacity\\Project 3\\1988.csv',
# 'c:\\users\\okpokou\\Documents\\udacity\\Project 3\\1989.csv']
# for i in range(len(compressed_file)):
# file = new_file[i]
# stream = bz2.BZ2File(compressed_file[i])
# with open(file, 'wb') as f:
# for i in range(1,1000000):
# f.write(stream.readline())
# loading data into appropraite DataFrames.
df_1987 = pd.read_csv('1987.csv')
df_1988 = pd.read_csv('1988.csv')
df_1989 = pd.read_csv('1989.csv')
print('df_1987 shape: ',df_1987.shape)
print('df_1988 shape: ',df_1988.shape)
print('df_1989 shape: ',df_1989.shape)
carriers = pd.read_csv('c:\\users\\okpokou\\Documents\\udacity\\Project 3\\dataverse_files\\carriers.csv')
airports = pd.read_csv('c:\\users\\okpokou\\Documents\\udacity\\Project 3\\dataverse_files\\airports.csv')
plane_data = pd.read_csv('c:\\users\\okpokou\\Documents\\udacity\\Project 3\\dataverse_files\\plane-data.csv')
print('carriers shape: ',carriers.shape)
print('airports shape: ',airports.shape)
print('plane_data shape: ',plane_data.shape)
df_1987 shape: (999998, 29) df_1988 shape: (999998, 29) df_1989 shape: (999998, 29) carriers shape: (1491, 2) airports shape: (3376, 7) plane_data shape: (5029, 9)
# confirming that the columns are the same for the different years.
sum(df_1987.columns == df_1988.columns , df_1989.columns == df_1987.columns)
array([ True, True, True, True, True, True, True, True, True,
True, True, True, True, True, True, True, True, True,
True, True, True, True, True, True, True, True, True,
True, True])
df_1987 = df_1987.append(df_1988,ignore_index=True)
flights = df_1987.append(df_1989,ignore_index=True)
flights.shape
(2999994, 29)
flights.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 2999994 entries, 0 to 2999993 Data columns (total 29 columns): # Column Dtype --- ------ ----- 0 Year int64 1 Month int64 2 DayofMonth int64 3 DayOfWeek int64 4 DepTime float64 5 CRSDepTime int64 6 ArrTime float64 7 CRSArrTime int64 8 UniqueCarrier object 9 FlightNum int64 10 TailNum float64 11 ActualElapsedTime float64 12 CRSElapsedTime int64 13 AirTime float64 14 ArrDelay float64 15 DepDelay float64 16 Origin object 17 Dest object 18 Distance float64 19 TaxiIn float64 20 TaxiOut float64 21 Cancelled int64 22 CancellationCode float64 23 Diverted int64 24 CarrierDelay float64 25 WeatherDelay float64 26 NASDelay float64 27 SecurityDelay float64 28 LateAircraftDelay float64 dtypes: float64(16), int64(10), object(3) memory usage: 663.8+ MB
flights.isna().sum()
Year 0 Month 0 DayofMonth 0 DayOfWeek 0 DepTime 55424 CRSDepTime 0 ArrTime 65560 CRSArrTime 0 UniqueCarrier 0 FlightNum 0 TailNum 2999994 ActualElapsedTime 65560 CRSElapsedTime 0 AirTime 2999994 ArrDelay 65560 DepDelay 55424 Origin 0 Dest 0 Distance 1985 TaxiIn 2999994 TaxiOut 2999994 Cancelled 0 CancellationCode 2999994 Diverted 0 CarrierDelay 2999994 WeatherDelay 2999994 NASDelay 2999994 SecurityDelay 2999994 LateAircraftDelay 2999994 dtype: int64
flights.head(10)
| Year | Month | DayofMonth | DayOfWeek | DepTime | CRSDepTime | ArrTime | CRSArrTime | UniqueCarrier | FlightNum | ... | TaxiIn | TaxiOut | Cancelled | CancellationCode | Diverted | CarrierDelay | WeatherDelay | NASDelay | SecurityDelay | LateAircraftDelay | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1987 | 10 | 14 | 3 | 741.0 | 730 | 912.0 | 849 | PS | 1451 | ... | NaN | NaN | 0 | NaN | 0 | NaN | NaN | NaN | NaN | NaN |
| 1 | 1987 | 10 | 15 | 4 | 729.0 | 730 | 903.0 | 849 | PS | 1451 | ... | NaN | NaN | 0 | NaN | 0 | NaN | NaN | NaN | NaN | NaN |
| 2 | 1987 | 10 | 17 | 6 | 741.0 | 730 | 918.0 | 849 | PS | 1451 | ... | NaN | NaN | 0 | NaN | 0 | NaN | NaN | NaN | NaN | NaN |
| 3 | 1987 | 10 | 18 | 7 | 729.0 | 730 | 847.0 | 849 | PS | 1451 | ... | NaN | NaN | 0 | NaN | 0 | NaN | NaN | NaN | NaN | NaN |
| 4 | 1987 | 10 | 19 | 1 | 749.0 | 730 | 922.0 | 849 | PS | 1451 | ... | NaN | NaN | 0 | NaN | 0 | NaN | NaN | NaN | NaN | NaN |
| 5 | 1987 | 10 | 21 | 3 | 728.0 | 730 | 848.0 | 849 | PS | 1451 | ... | NaN | NaN | 0 | NaN | 0 | NaN | NaN | NaN | NaN | NaN |
| 6 | 1987 | 10 | 22 | 4 | 728.0 | 730 | 852.0 | 849 | PS | 1451 | ... | NaN | NaN | 0 | NaN | 0 | NaN | NaN | NaN | NaN | NaN |
| 7 | 1987 | 10 | 23 | 5 | 731.0 | 730 | 902.0 | 849 | PS | 1451 | ... | NaN | NaN | 0 | NaN | 0 | NaN | NaN | NaN | NaN | NaN |
| 8 | 1987 | 10 | 24 | 6 | 744.0 | 730 | 908.0 | 849 | PS | 1451 | ... | NaN | NaN | 0 | NaN | 0 | NaN | NaN | NaN | NaN | NaN |
| 9 | 1987 | 10 | 25 | 7 | 729.0 | 730 | 851.0 | 849 | PS | 1451 | ... | NaN | NaN | 0 | NaN | 0 | NaN | NaN | NaN | NaN | NaN |
10 rows × 29 columns
As observed, we have to extract another dataset as "no flights" were cancelled and the required variables for the analysis (reasons for the delay or cancelled flights) are NaNs.
# Extracting csv files from the downloaded BZip2 files.
# compressed_file = ['c:\\users\\okpokou\\Documents\\udacity\\Project 3\\Compressed files\\2007.csv.bz2',
# 'c:\\users\\okpokou\\Documents\\udacity\\Project 3\\Compressed files\\2008.csv.bz2',
# 'c:\\users\\okpokou\\Documents\\udacity\\Project 3\\Compressed files\\2006.csv.bz2']
# new_file = ['c:\\users\\okpokou\\Documents\\udacity\\Project 3\\2007.csv',
# 'c:\\users\\okpokou\\Documents\\udacity\\Project 3\\2008.csv',
# 'c:\\users\\okpokou\\Documents\\udacity\\Project 3\\2006.csv']
# for i in range(len(compressed_file)):
# file = new_file[i]
# stream = bz2.BZ2File(compressed_file[i])
# with open(file, 'wb') as f:
# for i in range(1,2000000):
# f.write(stream.readline())
# loading data into appropraite DataFrames.
df_06 = pd.read_csv('2006.csv')
df_07 = pd.read_csv('2007.csv')
df_08 = pd.read_csv('2008.csv')
print('df_06 shape: ',df_06.shape)
print('df_07 shape: ',df_07.shape)
print('df_08 shape: ',df_08.shape)
df_06 shape: (1999998, 29) df_07 shape: (1999998, 29) df_08 shape: (1999998, 29)
# confirming that the columns are the same for the different years.
sum(df_06.columns == df_07.columns , df_08.columns == df_07.columns)
array([ True, True, True, True, True, True, True, True, True,
True, True, True, True, True, True, True, True, True,
True, True, True, True, True, True, True, True, True,
True, True])
# Append the DataFrames into one Table.
df_06 = df_06.append(df_07,ignore_index=True)
flights = df_06.append(df_08,ignore_index=True)
flights.shape
(5999994, 29)
flights.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 5999994 entries, 0 to 5999993 Data columns (total 29 columns): # Column Dtype --- ------ ----- 0 Year int64 1 Month int64 2 DayofMonth int64 3 DayOfWeek int64 4 DepTime float64 5 CRSDepTime int64 6 ArrTime float64 7 CRSArrTime int64 8 UniqueCarrier object 9 FlightNum int64 10 TailNum object 11 ActualElapsedTime float64 12 CRSElapsedTime float64 13 AirTime float64 14 ArrDelay float64 15 DepDelay float64 16 Origin object 17 Dest object 18 Distance int64 19 TaxiIn float64 20 TaxiOut float64 21 Cancelled int64 22 CancellationCode object 23 Diverted int64 24 CarrierDelay float64 25 WeatherDelay float64 26 NASDelay float64 27 SecurityDelay float64 28 LateAircraftDelay float64 dtypes: float64(14), int64(10), object(5) memory usage: 1.3+ GB
flights.isna().sum()
Year 0 Month 0 DayofMonth 0 DayOfWeek 0 DepTime 149052 CRSDepTime 0 ArrTime 162200 CRSArrTime 0 UniqueCarrier 0 FlightNum 0 TailNum 35786 ActualElapsedTime 162200 CRSElapsedTime 451 AirTime 162200 ArrDelay 162200 DepDelay 149052 Origin 0 Dest 0 Distance 0 TaxiIn 61415 TaxiOut 56510 Cancelled 0 CancellationCode 5850942 Diverted 0 CarrierDelay 1500345 WeatherDelay 1500345 NASDelay 1500345 SecurityDelay 1500345 LateAircraftDelay 1500345 dtype: int64
# checking for duplicates.
flights.duplicated().sum()
10
flights[flights.duplicated()]
| Year | Month | DayofMonth | DayOfWeek | DepTime | CRSDepTime | ArrTime | CRSArrTime | UniqueCarrier | FlightNum | ... | TaxiIn | TaxiOut | Cancelled | CancellationCode | Diverted | CarrierDelay | WeatherDelay | NASDelay | SecurityDelay | LateAircraftDelay | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 752508 | 2006 | 2 | 20 | 1 | 1204.0 | 1157 | 1244.0 | 1240 | OO | 6117 | ... | 3.0 | 9.0 | 0 | NaN | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 888040 | 2006 | 2 | 19 | 7 | 821.0 | 730 | 948.0 | 850 | FL | 768 | ... | 15.0 | 22.0 | 0 | NaN | 0 | 0.0 | 0.0 | 7.0 | 0.0 | 51.0 |
| 1039373 | 2006 | 2 | 24 | 5 | 913.0 | 919 | NaN | 1054 | AS | 152 | ... | 0.0 | 17.0 | 0 | NaN | 1 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 1462456 | 2006 | 3 | 22 | 3 | 835.0 | 800 | 1101.0 | 1040 | FL | 762 | ... | 4.0 | 10.0 | 0 | NaN | 0 | 21.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 2354400 | 2007 | 1 | 14 | 7 | 35.0 | 35 | 618.0 | 605 | F9 | 514 | ... | 37.0 | 12.0 | 0 | NaN | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 2356312 | 2007 | 1 | 21 | 7 | 32.0 | 35 | 621.0 | 605 | F9 | 514 | ... | 35.0 | 10.0 | 0 | NaN | 0 | 0.0 | 0.0 | 16.0 | 0.0 | 0.0 |
| 4937923 | 2008 | 2 | 28 | 4 | 829.0 | 835 | 1341.0 | 1350 | F9 | 728 | ... | 6.0 | 14.0 | 0 | NaN | 0 | NaN | NaN | NaN | NaN | NaN |
| 4937925 | 2008 | 2 | 28 | 4 | 1427.0 | 1435 | 1640.0 | 1636 | F9 | 727 | ... | 9.0 | 16.0 | 0 | NaN | 0 | NaN | NaN | NaN | NaN | NaN |
| 4937927 | 2008 | 2 | 28 | 4 | 1854.0 | 1807 | 1946.0 | 1902 | F9 | 773 | ... | 8.0 | 13.0 | 0 | NaN | 0 | 44.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 4937929 | 2008 | 2 | 28 | 4 | 2027.0 | 1942 | 2314.0 | 2229 | F9 | 780 | ... | 10.0 | 13.0 | 0 | NaN | 0 | 1.0 | 0.0 | 44.0 | 0.0 | 0.0 |
10 rows × 29 columns
flights.sample(50)
| Year | Month | DayofMonth | DayOfWeek | DepTime | CRSDepTime | ArrTime | CRSArrTime | UniqueCarrier | FlightNum | ... | TaxiIn | TaxiOut | Cancelled | CancellationCode | Diverted | CarrierDelay | WeatherDelay | NASDelay | SecurityDelay | LateAircraftDelay | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 717577 | 2006 | 2 | 19 | 7 | 1435.0 | 1440 | 1530.0 | 1540 | OH | 5707 | ... | 4.0 | 12.0 | 0 | NaN | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 4072884 | 2008 | 1 | 25 | 5 | 822.0 | 820 | 934.0 | 925 | WN | 1938 | ... | 4.0 | 11.0 | 0 | NaN | 0 | NaN | NaN | NaN | NaN | NaN |
| 3500346 | 2007 | 3 | 7 | 3 | 948.0 | 725 | 1100.0 | 842 | US | 1833 | ... | 11.0 | 10.0 | 0 | NaN | 0 | 0.0 | 0.0 | 138.0 | 0.0 | 0.0 |
| 4807130 | 2008 | 2 | 27 | 3 | 1916.0 | 1920 | 1947.0 | 1959 | OO | 6577 | ... | 4.0 | 12.0 | 0 | NaN | 0 | NaN | NaN | NaN | NaN | NaN |
| 5574449 | 2008 | 3 | 25 | 2 | 1255.0 | 1300 | 1451.0 | 1500 | MQ | 3651 | ... | 4.0 | 16.0 | 0 | NaN | 0 | NaN | NaN | NaN | NaN | NaN |
| 1646761 | 2006 | 3 | 29 | 3 | 727.0 | 730 | 1000.0 | 1000 | B6 | 34 | ... | 8.0 | 16.0 | 0 | NaN | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 5397729 | 2008 | 3 | 19 | 3 | 2053.0 | 2045 | 455.0 | 448 | UA | 58 | ... | 9.0 | 12.0 | 0 | NaN | 0 | NaN | NaN | NaN | NaN | NaN |
| 2140549 | 2007 | 1 | 14 | 7 | 2050.0 | 2010 | 2120.0 | 2045 | YV | 2716 | ... | 7.0 | 10.0 | 0 | NaN | 0 | 35.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 3934489 | 2007 | 4 | 13 | 5 | 1613.0 | 1410 | 1823.0 | 1615 | XE | 2500 | ... | 8.0 | 27.0 | 0 | NaN | 0 | 0.0 | 0.0 | 128.0 | 0.0 | 0.0 |
| 2934497 | 2007 | 2 | 25 | 7 | 1308.0 | 1210 | 1441.0 | 1339 | EV | 4580 | ... | 18.0 | 7.0 | 0 | NaN | 0 | 0.0 | 0.0 | 4.0 | 0.0 | 58.0 |
| 1789015 | 2006 | 4 | 13 | 4 | 1435.0 | 1435 | 1505.0 | 1500 | WN | 990 | ... | 5.0 | 10.0 | 0 | NaN | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 576511 | 2006 | 1 | 26 | 4 | 1142.0 | 1150 | 2024.0 | 2014 | DL | 2022 | ... | 20.0 | 20.0 | 0 | NaN | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 3653096 | 2007 | 3 | 16 | 5 | 833.0 | 835 | 1009.0 | 1016 | NW | 952 | ... | 4.0 | 16.0 | 0 | NaN | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 226656 | 2006 | 1 | 25 | 3 | 837.0 | 840 | 1110.0 | 1120 | XE | 2179 | ... | 11.0 | 23.0 | 0 | NaN | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 4372448 | 2008 | 1 | 28 | 1 | 1054.0 | 1049 | 1200.0 | 1158 | FL | 985 | ... | 6.0 | 9.0 | 0 | NaN | 0 | NaN | NaN | NaN | NaN | NaN |
| 656191 | 2006 | 2 | 16 | 4 | 1550.0 | 1510 | 1659.0 | 1620 | WN | 1794 | ... | 1.0 | 9.0 | 0 | NaN | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 39.0 |
| 5105770 | 2008 | 2 | 8 | 5 | 1440.0 | 1445 | 2146.0 | 2155 | AQ | 486 | ... | 5.0 | 13.0 | 0 | NaN | 0 | NaN | NaN | NaN | NaN | NaN |
| 1827096 | 2006 | 4 | 26 | 3 | 1235.0 | 1240 | 1840.0 | 1905 | WN | 324 | ... | 2.0 | 10.0 | 0 | NaN | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 2554965 | 2007 | 1 | 8 | 1 | 659.0 | 700 | 931.0 | 934 | AS | 464 | ... | 3.0 | 21.0 | 0 | NaN | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 499382 | 2006 | 1 | 22 | 7 | 2055.0 | 2035 | 2217.0 | 2155 | AS | 498 | ... | 6.0 | 12.0 | 0 | NaN | 0 | 0.0 | 0.0 | 2.0 | 0.0 | 20.0 |
| 4094394 | 2008 | 1 | 25 | 5 | 1120.0 | 1120 | 1222.0 | 1223 | XE | 105 | ... | 6.0 | 10.0 | 0 | NaN | 0 | NaN | NaN | NaN | NaN | NaN |
| 5205854 | 2008 | 3 | 12 | 3 | 1952.0 | 1935 | 2156.0 | 2130 | WN | 274 | ... | 3.0 | 11.0 | 0 | NaN | 0 | 17.0 | 0.0 | 9.0 | 0.0 | 0.0 |
| 2226114 | 2007 | 1 | 26 | 5 | 2119.0 | 2130 | 445.0 | 454 | UA | 68 | ... | 7.0 | 18.0 | 0 | NaN | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 2795991 | 2007 | 2 | 9 | 5 | 828.0 | 830 | 955.0 | 955 | OO | 5706 | ... | 1.0 | 12.0 | 0 | NaN | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 185704 | 2006 | 1 | 21 | 6 | 640.0 | 635 | 737.0 | 729 | OO | 6516 | ... | 6.0 | 15.0 | 0 | NaN | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 4227708 | 2008 | 1 | 11 | 5 | 758.0 | 800 | 1107.0 | 1111 | UA | 346 | ... | 3.0 | 20.0 | 0 | NaN | 0 | NaN | NaN | NaN | NaN | NaN |
| 1879743 | 2006 | 4 | 1 | 6 | 1314.0 | 1323 | 1336.0 | 1350 | OO | 6706 | ... | 8.0 | 6.0 | 0 | NaN | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 5311756 | 2008 | 3 | 7 | 5 | 1831.0 | 1830 | 1915.0 | 1910 | YV | 1029 | ... | 4.0 | 11.0 | 0 | NaN | 0 | NaN | NaN | NaN | NaN | NaN |
| 2687216 | 2007 | 2 | 22 | 4 | 704.0 | 705 | 904.0 | 855 | WN | 1453 | ... | 3.0 | 15.0 | 0 | NaN | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 2120125 | 2007 | 1 | 8 | 1 | NaN | 1435 | NaN | 1631 | XE | 2450 | ... | 0.0 | 0.0 | 1 | C | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 1890008 | 2006 | 4 | 8 | 6 | 1722.0 | 1554 | 1831.0 | 1712 | OO | 6507 | ... | 5.0 | 10.0 | 0 | NaN | 0 | 79.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 2833812 | 2007 | 2 | 9 | 5 | 1743.0 | 1745 | 1937.0 | 1919 | UA | 259 | ... | 14.0 | 23.0 | 0 | NaN | 0 | 0.0 | 0.0 | 18.0 | 0.0 | 0.0 |
| 3338751 | 2007 | 3 | 22 | 4 | 1031.0 | 1029 | 1251.0 | 1241 | YV | 7450 | ... | 8.0 | 22.0 | 0 | NaN | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 2220451 | 2007 | 1 | 29 | 1 | 529.0 | 530 | 606.0 | 615 | OO | 6106 | ... | 8.0 | 11.0 | 0 | NaN | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 5958431 | 2008 | 4 | 4 | 5 | 1952.0 | 1947 | 2029.0 | 2027 | OO | 5748 | ... | 4.0 | 7.0 | 0 | NaN | 0 | NaN | NaN | NaN | NaN | NaN |
| 3943773 | 2007 | 4 | 21 | 6 | 1139.0 | 1140 | 1328.0 | 1340 | XE | 2784 | ... | 7.0 | 6.0 | 0 | NaN | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 1471149 | 2006 | 3 | 14 | 2 | 1713.0 | 1705 | 1757.0 | 1752 | HA | 261 | ... | 5.0 | 4.0 | 0 | NaN | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 4808035 | 2008 | 2 | 28 | 4 | 1834.0 | 1838 | 1925.0 | 1935 | OO | 5518 | ... | 4.0 | 13.0 | 0 | NaN | 0 | NaN | NaN | NaN | NaN | NaN |
| 5313765 | 2008 | 3 | 10 | 1 | 2005.0 | 2005 | 2105.0 | 2101 | YV | 2832 | ... | 7.0 | 27.0 | 0 | NaN | 0 | NaN | NaN | NaN | NaN | NaN |
| 380532 | 2006 | 1 | 7 | 6 | 638.0 | 643 | 720.0 | 739 | MQ | 4402 | ... | 6.0 | 4.0 | 0 | NaN | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 1553469 | 2006 | 3 | 20 | 1 | 904.0 | 908 | 1104.0 | 1112 | NW | 1908 | ... | 6.0 | 18.0 | 0 | NaN | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 2965858 | 2007 | 2 | 24 | 6 | 925.0 | 925 | 1116.0 | 1110 | FL | 775 | ... | 9.0 | 21.0 | 0 | NaN | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 3050093 | 2007 | 2 | 6 | 2 | 600.0 | 605 | 755.0 | 751 | 9E | 3709 | ... | 12.0 | 9.0 | 0 | NaN | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 4373224 | 2008 | 1 | 29 | 2 | 1353.0 | 1400 | 1558.0 | 1606 | FL | 345 | ... | 10.0 | 12.0 | 0 | NaN | 0 | NaN | NaN | NaN | NaN | NaN |
| 459364 | 2006 | 1 | 23 | 1 | 554.0 | 600 | 730.0 | 736 | AA | 1171 | ... | 5.0 | 10.0 | 0 | NaN | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 666956 | 2006 | 2 | 20 | 1 | 2205.0 | 2115 | 2230.0 | 2135 | WN | 713 | ... | 10.0 | 10.0 | 0 | NaN | 0 | 10.0 | 0.0 | 5.0 | 0.0 | 40.0 |
| 1849808 | 2006 | 4 | 13 | 4 | 1645.0 | 1645 | 1806.0 | 1832 | YV | 7319 | ... | 4.0 | 6.0 | 0 | NaN | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 5567552 | 2008 | 3 | 25 | 2 | 801.0 | 805 | 926.0 | 935 | MQ | 3355 | ... | 2.0 | 10.0 | 0 | NaN | 0 | NaN | NaN | NaN | NaN | NaN |
| 1741755 | 2006 | 4 | 14 | 5 | 1519.0 | 1520 | 1622.0 | 1616 | US | 397 | ... | 7.0 | 11.0 | 0 | NaN | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 405706 | 2006 | 1 | 20 | 5 | 1412.0 | 1408 | 1526.0 | 1526 | NW | 768 | ... | 6.0 | 17.0 | 0 | NaN | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
50 rows × 29 columns
flights.describe()
| Year | Month | DayofMonth | DayOfWeek | DepTime | CRSDepTime | ArrTime | CRSArrTime | FlightNum | ActualElapsedTime | ... | Distance | TaxiIn | TaxiOut | Cancelled | Diverted | CarrierDelay | WeatherDelay | NASDelay | SecurityDelay | LateAircraftDelay | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 5.999994e+06 | 5.999994e+06 | 5.999994e+06 | 5.999994e+06 | 5.850942e+06 | 5.999994e+06 | 5.837794e+06 | 5.999994e+06 | 5.999994e+06 | 5.837794e+06 | ... | 5.999994e+06 | 5.938579e+06 | 5.943484e+06 | 5.999994e+06 | 5.999994e+06 | 4.499649e+06 | 4.499649e+06 | 4.499649e+06 | 4.499649e+06 | 4.499649e+06 |
| mean | 2.007000e+03 | 2.230244e+00 | 1.563305e+01 | 3.924221e+00 | 1.342737e+03 | 1.332761e+03 | 1.489363e+03 | 1.497670e+03 | 2.280889e+03 | 1.250645e+02 | ... | 7.082536e+02 | 6.644308e+00 | 1.577898e+01 | 2.484202e-02 | 2.191336e-03 | 4.834934e+00 | 9.635978e-01 | 4.811443e+00 | 3.183337e-02 | 6.473191e+00 |
| std | 8.164966e-01 | 1.000848e+00 | 8.712456e+00 | 1.987410e+00 | 4.769305e+02 | 4.633683e+02 | 5.020008e+02 | 4.804109e+02 | 2.016044e+03 | 6.970250e+01 | ... | 5.514096e+02 | 1.755335e+01 | 1.080802e+01 | 1.556435e-01 | 4.676039e-02 | 2.244798e+01 | 1.076476e+01 | 1.769023e+01 | 1.258402e+00 | 2.349534e+01 |
| min | 2.006000e+03 | 1.000000e+00 | 1.000000e+00 | 1.000000e+00 | 1.000000e+00 | 0.000000e+00 | 1.000000e+00 | 1.000000e+00 | 1.000000e+00 | 1.000000e+01 | ... | 1.100000e+01 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 |
| 25% | 2.006000e+03 | 1.000000e+00 | 8.000000e+00 | 2.000000e+00 | 9.330000e+02 | 9.300000e+02 | 1.113000e+03 | 1.118000e+03 | 6.250000e+02 | 7.500000e+01 | ... | 3.160000e+02 | 4.000000e+00 | 1.000000e+01 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 |
| 50% | 2.007000e+03 | 2.000000e+00 | 1.600000e+01 | 4.000000e+00 | 1.333000e+03 | 1.325000e+03 | 1.518000e+03 | 1.520000e+03 | 1.591000e+03 | 1.070000e+02 | ... | 5.520000e+02 | 5.000000e+00 | 1.300000e+01 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 |
| 75% | 2.008000e+03 | 3.000000e+00 | 2.300000e+01 | 6.000000e+00 | 1.734000e+03 | 1.720000e+03 | 1.913000e+03 | 1.910000e+03 | 3.679000e+03 | 1.560000e+02 | ... | 9.380000e+02 | 8.000000e+00 | 1.900000e+01 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 |
| max | 2.008000e+03 | 4.000000e+00 | 3.100000e+01 | 7.000000e+00 | 2.755000e+03 | 2.359000e+03 | 2.923000e+03 | 2.400000e+03 | 9.740000e+03 | 1.760000e+03 | ... | 4.962000e+03 | 1.470000e+03 | 6.020000e+02 | 1.000000e+00 | 1.000000e+00 | 1.490000e+03 | 1.148000e+03 | 1.374000e+03 | 3.820000e+02 | 1.254000e+03 |
8 rows × 24 columns
flights.columns
Index(['Year', 'Month', 'DayofMonth', 'DayOfWeek', 'DepTime', 'CRSDepTime',
'ArrTime', 'CRSArrTime', 'UniqueCarrier', 'FlightNum', 'TailNum',
'ActualElapsedTime', 'CRSElapsedTime', 'AirTime', 'ArrDelay',
'DepDelay', 'Origin', 'Dest', 'Distance', 'TaxiIn', 'TaxiOut',
'Cancelled', 'CancellationCode', 'Diverted', 'CarrierDelay',
'WeatherDelay', 'NASDelay', 'SecurityDelay', 'LateAircraftDelay'],
dtype='object')
# 1-D exploration of hidden colomns.
hidden_columns =['TailNum', 'ActualElapsedTime', 'CRSElapsedTime',
'AirTime', 'ArrDelay','DepDelay', 'Origin', 'Dest', 'Distance']
flights[hidden_columns].head(10)
| TailNum | ActualElapsedTime | CRSElapsedTime | AirTime | ArrDelay | DepDelay | Origin | Dest | Distance | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | N657AW | 281.0 | 273.0 | 223.0 | 6.0 | -2.0 | ATL | PHX | 1587 |
| 1 | N834AW | 260.0 | 265.0 | 214.0 | -5.0 | 0.0 | ATL | PHX | 1587 |
| 2 | N605AW | 235.0 | 258.0 | 220.0 | -23.0 | 0.0 | ATL | PHX | 1587 |
| 3 | N312AW | 152.0 | 158.0 | 126.0 | -8.0 | -2.0 | AUS | PHX | 872 |
| 4 | N309AW | 171.0 | 163.0 | 132.0 | 0.0 | -8.0 | AUS | PHX | 872 |
| 5 | N733UW | 127.0 | 122.0 | 108.0 | 2.0 | -3.0 | BDL | CLT | 644 |
| 6 | N177UW | 136.0 | 121.0 | 111.0 | 20.0 | 5.0 | BDL | CLT | 644 |
| 7 | N404US | 133.0 | 123.0 | 121.0 | 7.0 | -3.0 | BDL | CLT | 644 |
| 8 | N722UW | 119.0 | 120.0 | 103.0 | -7.0 | -6.0 | BDL | CLT | 644 |
| 9 | N104UW | 119.0 | 120.0 | 105.0 | -4.0 | -3.0 | BDL | CLT | 644 |
flights[hidden_columns].describe()
| ActualElapsedTime | CRSElapsedTime | AirTime | ArrDelay | DepDelay | Distance | |
|---|---|---|---|---|---|---|
| count | 5.837794e+06 | 5.999543e+06 | 5.837794e+06 | 5.837794e+06 | 5.850942e+06 | 5.999994e+06 |
| mean | 1.250645e+02 | 1.263267e+02 | 1.022940e+02 | 9.370251e+00 | 1.091705e+01 | 7.082536e+02 |
| std | 6.970250e+01 | 6.899270e+01 | 6.958875e+01 | 3.775493e+01 | 3.469922e+01 | 5.514096e+02 |
| min | 1.000000e+01 | -1.240000e+03 | -1.419000e+03 | -5.750000e+02 | -1.197000e+03 | 1.100000e+01 |
| 25% | 7.500000e+01 | 7.600000e+01 | 5.400000e+01 | -9.000000e+00 | -4.000000e+00 | 3.160000e+02 |
| 50% | 1.070000e+02 | 1.080000e+02 | 8.400000e+01 | -1.000000e+00 | 0.000000e+00 | 5.520000e+02 |
| 75% | 1.560000e+02 | 1.560000e+02 | 1.310000e+02 | 1.400000e+01 | 1.000000e+01 | 9.380000e+02 |
| max | 1.760000e+03 | 1.435000e+03 | 1.710000e+03 | 2.461000e+03 | 2.457000e+03 | 4.962000e+03 |
As observed, the stated delay time for some samples on the hidden columns are in negative. These points should be looked at closely for a decision (modification or filtering) to be reached.
flights.CancellationCode.value_counts()
B 69560 A 54060 C 25304 D 128 Name: CancellationCode, dtype: int64
flights.UniqueCarrier.value_counts()
WN 1118872 OO 508367 AA 473161 XE 414659 US 399459 UA 397127 MQ 389270 DL 361303 NW 301192 YV 284542 OH 231356 CO 230282 EV 204368 FL 181396 B6 131005 9E 129657 AS 112568 F9 66054 HA 39280 AQ 19172 TZ 6904 Name: UniqueCarrier, dtype: int64
# check for samples where the local time (HH:MM) are out of the expected 24 hours range.
print(flights.ArrTime.min(),flights.ArrTime.max())
print(flights.CRSArrTime.min(),flights.CRSArrTime.max())
print(flights.DepTime.min(),flights.DepTime.max())
print(flights.CRSDepTime.min(),flights.CRSDepTime.max())
1.0 2923.0 1 2400 1.0 2755.0 0 2359
Departure time looks okay, however, the other variables are outside the possible 24 hours per day range. This must be reviewed (modified or dropped).
# Exploring the other DataFrames
print('carriers shape: ',carriers.shape)
carriers.sample(10)
carriers shape: (1491, 2)
| Code | Description | |
|---|---|---|
| 367 | CLB | Columbia Pacific Airlines |
| 1423 | XAU | Flight Line Inc. |
| 142 | AEX | Air Excursions LLC |
| 1419 | XAQ | Cam Air International Inc. |
| 1085 | RL | UltrAir |
| 969 | PAI | Permian Airways Inc. |
| 905 | NKQ | Norcanair |
| 979 | PD (1) | Pem-Air Limited |
| 374 | CMQ | Servicio De Carga Aerea Sa |
| 1392 | WP | Island Air Hawaii |
print('airports shape: ',airports.shape)
airports.sample(10)
airports shape: (3376, 7)
| iata | airport | city | state | country | lat | long | |
|---|---|---|---|---|---|---|---|
| 3199 | UGN | Waukegan Regional | Chicago/Waukegan | IL | USA | 42.422160 | -87.867907 |
| 2516 | ONP | Newport Muni | Newport | OR | USA | 44.580361 | -124.057917 |
| 1838 | IAN | Bob Baker Memorial | Kiana | AK | USA | 66.979376 | -160.435860 |
| 748 | A29 | Sitka SPB | Sitka | AK | USA | 57.052138 | -135.346209 |
| 3314 | X07 | Lake Wales Municipal | Lake Wales | FL | USA | 27.893806 | -81.620389 |
| 3104 | TIX | Space Cost Regional | Titusville | FL | USA | 28.514799 | -80.799228 |
| 1209 | CWI | Clinton Municipal | Clinton | IA | USA | 41.831113 | -90.329131 |
| 1897 | IWK | Wales | Wales | AK | USA | 65.623940 | -168.099172 |
| 2741 | Q95 | Ruth | Ruth | CA | USA | 40.211259 | -123.297523 |
| 2669 | PRN | Greenville Municipal | Greenville | AL | USA | 31.845402 | -86.610446 |
print('plane_data shape: ',plane_data.shape)
plane_data.sample(10)
plane_data shape: (5029, 9)
| tailnum | type | manufacturer | issue_date | model | status | aircraft_type | engine_type | year | |
|---|---|---|---|---|---|---|---|---|---|
| 3616 | N755US | Corporation | AIRBUS INDUSTRIE | 11/16/2000 | A319-112 | Valid | Fixed Wing Multi-Engine | Turbo-Jet | 2000 |
| 3893 | N7BHAA | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 1316 | N363NB | Corporation | AIRBUS | 06/20/2003 | A319-114 | Valid | Fixed Wing Multi-Engine | Turbo-Fan | 2003 |
| 820 | N26226 | Corporation | BOEING | 12/23/1998 | 737-824 | Valid | Fixed Wing Multi-Engine | Turbo-Fan | 1998 |
| 3367 | N705AS | Corporation | BOEING | 07/08/1998 | 737-490 | Valid | Fixed Wing Multi-Engine | Turbo-Jet | 1998 |
| 4721 | N9337 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 3552 | N741UW | Corporation | AIRBUS INDUSTRIE | 08/11/2000 | A319-112 | Valid | Fixed Wing Multi-Engine | Turbo-Jet | 2000 |
| 4559 | N91629E | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 3870 | N7AGAA | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 3602 | N753SW | Corporation | BOEING | 12/03/1999 | 737-7H4 | Valid | Fixed Wing Multi-Engine | Turbo-Fan | 1999 |
plane_data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 5029 entries, 0 to 5028 Data columns (total 9 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 tailnum 5029 non-null object 1 type 4480 non-null object 2 manufacturer 4480 non-null object 3 issue_date 4480 non-null object 4 model 4480 non-null object 5 status 4480 non-null object 6 aircraft_type 4480 non-null object 7 engine_type 4480 non-null object 8 year 4480 non-null object dtypes: object(9) memory usage: 353.7+ KB
# checking columns be to utilize for merging the different DataFrames.
flights.Origin.value_counts(), airports.iata.value_counts(), carriers.Code.value_counts()
(ATL 300773
ORD 295966
DFW 218075
DEN 193324
LAX 191008
...
HKY 42
EAU 36
RHI 34
SOP 13
HVN 3
Name: Origin, Length: 301, dtype: int64,
00M 1
MKT 1
MHL 1
MHM 1
MHP 1
..
CKF 1
CKI 1
CKM 1
CKN 1
ZZV 1
Name: iata, Length: 3376, dtype: int64,
02Q 1
PHL 1
PLA 1
PL 1
PKQ 1
..
ENT 1
EMP 1
EME 1
EMA 1
ZYZ 1
Name: Code, Length: 1490, dtype: int64)
print(flights.TailNum.value_counts(), plane_data.tailnum.value_counts())
0 59529
000000 6365
N308SW 3477
N485HA 3058
N234SW 3026
...
N194JB 1
N702DN 1
N37018 1
N853NW 1
N74007 1
Name: TailNum, Length: 5855, dtype: int64 N050AA 1
N702TW 1
N703MR 1
N703JB 1
N703AS 1
..
N410UA 1
N410SW 1
N410AA 1
N409WN 1
N999DN 1
Name: tailnum, Length: 5029, dtype: int64
# Confirming that iata codes in airports DataFrame
# is same as Origin and Destination columns in Flights.
check = airports.iata.values
flights[['Origin','Dest']].isin(check).sum()
Origin 5999994 Dest 5999994 dtype: int64
# Confirming that code column in Carriers DataFrame
# is same as UniqueCarrier columns in Flights.
check = carriers['Code'].values
flights['UniqueCarrier'].isin(check).sum()
5999994
# Confirming that tailnum column in plane-data DataFrame
# is same as tailnum column in Flights.
check = plane_data['tailnum'].values
flights['TailNum'].isin(check).sum()
5232742
We can merge the different DataFrames with the identified columns
We can merge the different DataFrames with the identified columns;
# create a copy of the flight data before wrangling process.
flight_data = flights.copy()
flight_data.shape
(5999994, 29)
# merging the DataFrames.
flight_data = flight_data.merge(right= airports, how= 'inner', left_on= 'Origin', right_on= 'iata')
flight_data = flight_data.merge(right= carriers, how= 'inner', left_on= 'UniqueCarrier', right_on= 'Code')
flight_data = flight_data.merge(right= plane_data, how= 'inner', left_on= 'TailNum', right_on= 'tailnum')
flight_data.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 5232742 entries, 0 to 5232741 Data columns (total 47 columns): # Column Dtype --- ------ ----- 0 Year int64 1 Month int64 2 DayofMonth int64 3 DayOfWeek int64 4 DepTime float64 5 CRSDepTime int64 6 ArrTime float64 7 CRSArrTime int64 8 UniqueCarrier object 9 FlightNum int64 10 TailNum object 11 ActualElapsedTime float64 12 CRSElapsedTime float64 13 AirTime float64 14 ArrDelay float64 15 DepDelay float64 16 Origin object 17 Dest object 18 Distance int64 19 TaxiIn float64 20 TaxiOut float64 21 Cancelled int64 22 CancellationCode object 23 Diverted int64 24 CarrierDelay float64 25 WeatherDelay float64 26 NASDelay float64 27 SecurityDelay float64 28 LateAircraftDelay float64 29 iata object 30 airport object 31 city object 32 state object 33 country object 34 lat float64 35 long float64 36 Code object 37 Description object 38 tailnum object 39 type object 40 manufacturer object 41 issue_date object 42 model object 43 status object 44 aircraft_type object 45 engine_type object 46 year object dtypes: float64(16), int64(10), object(21) memory usage: 1.9+ GB
flight_data.columns
Index(['Year', 'Month', 'DayofMonth', 'DayOfWeek', 'DepTime', 'CRSDepTime',
'ArrTime', 'CRSArrTime', 'UniqueCarrier', 'FlightNum', 'TailNum',
'ActualElapsedTime', 'CRSElapsedTime', 'AirTime', 'ArrDelay',
'DepDelay', 'Origin', 'Dest', 'Distance', 'TaxiIn', 'TaxiOut',
'Cancelled', 'CancellationCode', 'Diverted', 'CarrierDelay',
'WeatherDelay', 'NASDelay', 'SecurityDelay', 'LateAircraftDelay',
'iata', 'airport', 'city', 'state', 'country', 'lat', 'long', 'Code',
'Description', 'tailnum', 'type', 'manufacturer', 'issue_date', 'model',
'status', 'aircraft_type', 'engine_type', 'year'],
dtype='object')
We will not include all of the columns, as only a few are required for our analysis
# Filtering the Flight DataFrame for only relevant columns
required_columns = ['Year', 'Month', 'DayofMonth', 'DayOfWeek', 'CRSDepTime','UniqueCarrier',
'Origin', 'Dest', 'Distance', 'Cancelled', 'CancellationCode', 'Diverted',
'airport', 'city', 'state','Description', 'type', 'manufacturer',
'aircraft_type', 'engine_type', 'year']
flight_data = flight_data[required_columns]
# Checking the flight Dataframe for changes.
flight_data.sample(20)
| Year | Month | DayofMonth | DayOfWeek | CRSDepTime | UniqueCarrier | Origin | Dest | Distance | Cancelled | ... | Diverted | airport | city | state | Description | type | manufacturer | aircraft_type | engine_type | year | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1516486 | 2006 | 4 | 25 | 2 | 625 | UA | LAX | ORD | 1745 | 0 | ... | 0 | Los Angeles International | Los Angeles | CA | United Air Lines Inc. | Corporation | BOEING | Fixed Wing Multi-Engine | Turbo-Fan | 2000 |
| 4605329 | 2008 | 3 | 20 | 4 | 700 | WN | SLC | DEN | 391 | 0 | ... | 0 | Salt Lake City Intl | Salt Lake City | UT | Southwest Airlines Co. | Corporation | BOEING | Fixed Wing Multi-Engine | Turbo-Fan | 2004 |
| 3921183 | 2006 | 3 | 1 | 3 | 1315 | OO | SLC | SAT | 1086 | 0 | ... | 0 | Salt Lake City Intl | Salt Lake City | UT | Skywest Airlines Inc. | Corporation | CANADAIR | Fixed Wing Multi-Engine | Turbo-Jet | 1995 |
| 4422176 | 2007 | 3 | 21 | 3 | 1455 | WN | RNO | LAS | 345 | 0 | ... | 0 | Reno/Tahoe International | Reno | NV | Southwest Airlines Co. | Corporation | BOEING | Fixed Wing Multi-Engine | Turbo-Jet | 1985 |
| 3717508 | 2007 | 2 | 8 | 4 | 1834 | OO | CID | DEN | 692 | 0 | ... | 0 | Eastern Iowa | Cedar Rapids | IA | Skywest Airlines Inc. | Corporation | BOMBARDIER INC | Fixed Wing Multi-Engine | Turbo-Fan | 2004 |
| 3518461 | 2008 | 3 | 21 | 5 | 1155 | CO | SAT | IAH | 191 | 0 | ... | 0 | San Antonio International | San Antonio | TX | Continental Air Lines Inc. | Corporation | BOEING | Fixed Wing Multi-Engine | Turbo-Jet | 1985 |
| 4542406 | 2008 | 3 | 19 | 3 | 825 | WN | ELP | LAX | 714 | 0 | ... | 0 | El Paso International | El Paso | TX | Southwest Airlines Co. | Corporation | BOEING | Fixed Wing Multi-Engine | Turbo-Fan | 2006 |
| 595107 | 2007 | 1 | 21 | 7 | 1115 | OH | RDU | LGA | 431 | 0 | ... | 0 | Raleigh-Durham International | Raleigh | NC | Comair Inc. | Corporation | CANADAIR | Fixed Wing Multi-Engine | Turbo-Jet | 1993 |
| 4426053 | 2008 | 3 | 4 | 2 | 1850 | WN | LAS | PHX | 256 | 0 | ... | 0 | McCarran International | Las Vegas | NV | Southwest Airlines Co. | Corporation | BOEING | Fixed Wing Multi-Engine | Turbo-Jet | 1985 |
| 2234693 | 2006 | 1 | 10 | 2 | 1842 | FL | LGA | PHF | 288 | 0 | ... | 0 | LaGuardia | New York | NY | AirTran Airways Corporation | Corporation | BOEING | Fixed Wing Multi-Engine | Turbo-Fan | 2000 |
| 36177 | 2008 | 1 | 24 | 4 | 940 | US | LAS | DFW | 1055 | 0 | ... | 0 | McCarran International | Las Vegas | NV | US Airways Inc. (Merged with America West 9/05... | Corporation | AIRBUS INDUSTRIE | Fixed Wing Multi-Engine | Turbo-Jet | 1999 |
| 4109404 | 2006 | 3 | 16 | 4 | 600 | OO | SBA | LAX | 89 | 0 | ... | 0 | Santa Barbara Municipal | Santa Barbara | CA | Skywest Airlines Inc. | Corporation | EMBRAER | Fixed Wing Multi-Engine | Turbo-Prop | 0000 |
| 3919549 | 2007 | 3 | 31 | 6 | 2020 | OO | PHX | LAX | 370 | 0 | ... | 0 | Phoenix Sky Harbor International | Phoenix | AZ | Skywest Airlines Inc. | Corporation | BOMBARDIER INC | Fixed Wing Multi-Engine | Turbo-Fan | 2004 |
| 2358937 | 2007 | 2 | 16 | 5 | 740 | FL | MDW | MIA | 1182 | 0 | ... | 0 | Chicago Midway | Chicago | IL | AirTran Airways Corporation | Corporation | BOEING | Fixed Wing Multi-Engine | Turbo-Fan | 2006 |
| 3418836 | 2006 | 1 | 20 | 5 | 728 | AA | MCO | STL | 880 | 0 | ... | 0 | Orlando International | Orlando | FL | American Airlines Inc. | NaN | NaN | NaN | NaN | NaN |
| 4937455 | 2008 | 4 | 14 | 1 | 810 | WN | MDW | LAS | 1521 | 0 | ... | 0 | Chicago Midway | Chicago | IL | Southwest Airlines Co. | Corporation | BOEING | Fixed Wing Multi-Engine | Turbo-Fan | 2007 |
| 3178502 | 2007 | 3 | 13 | 2 | 700 | AA | BUR | DFW | 1231 | 0 | ... | 0 | Burbank-Glendale-Pasadena | Burbank | CA | American Airlines Inc. | Corporation | MCDONNELL DOUGLAS | Fixed Wing Multi-Engine | Turbo-Fan | 1988 |
| 2770346 | 2008 | 3 | 20 | 4 | 1913 | NW | DTW | MKE | 238 | 0 | ... | 0 | Detroit Metropolitan-Wayne County | Detroit | MI | Northwest Airlines Inc. | Corporation | MCDONNELL DOUGLAS | Fixed Wing Multi-Engine | Turbo-Jet | 1975 |
| 690222 | 2006 | 1 | 8 | 7 | 2015 | OH | CVG | SAV | 515 | 0 | ... | 0 | Cincinnati Northern Kentucky Intl | Covington | KY | Comair Inc. | Corporation | CANADAIR | Fixed Wing Multi-Engine | Turbo-Jet | 1996 |
| 2713274 | 2006 | 2 | 17 | 5 | 1146 | MQ | SJT | DFW | 228 | 0 | ... | 0 | San Angelo Regional /Mathis | San Angelo | TX | American Eagle Airlines Inc. | Corporation | SAAB-SCANIA | Fixed Wing Multi-Engine | Turbo-Prop | None |
20 rows × 21 columns
# confirming that the UniqueCarrier column matches with the Description column
# This provides information on the Airlines.
flight_data.query('UniqueCarrier == "OO"')
| Year | Month | DayofMonth | DayOfWeek | CRSDepTime | UniqueCarrier | Origin | Dest | Distance | Cancelled | ... | Diverted | airport | city | state | Description | type | manufacturer | aircraft_type | engine_type | year | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 607769 | 2007 | 3 | 29 | 4 | 730 | OO | AUS | SLC | 1085 | 0 | ... | 0 | Austin-Bergstrom International | Austin | TX | Skywest Airlines Inc. | Corporation | BOMBARDIER INC | Fixed Wing Multi-Engine | Turbo-Fan | 2004 |
| 607770 | 2008 | 1 | 1 | 2 | 1800 | OO | AUS | SLC | 1085 | 0 | ... | 0 | Austin-Bergstrom International | Austin | TX | Skywest Airlines Inc. | Corporation | BOMBARDIER INC | Fixed Wing Multi-Engine | Turbo-Fan | 2004 |
| 607771 | 2008 | 1 | 16 | 3 | 800 | OO | AUS | SLC | 1085 | 0 | ... | 0 | Austin-Bergstrom International | Austin | TX | Skywest Airlines Inc. | Corporation | BOMBARDIER INC | Fixed Wing Multi-Engine | Turbo-Fan | 2004 |
| 607772 | 2008 | 1 | 17 | 4 | 1755 | OO | AUS | SLC | 1085 | 0 | ... | 0 | Austin-Bergstrom International | Austin | TX | Skywest Airlines Inc. | Corporation | BOMBARDIER INC | Fixed Wing Multi-Engine | Turbo-Fan | 2004 |
| 607773 | 2008 | 1 | 27 | 7 | 1755 | OO | AUS | SLC | 1085 | 0 | ... | 0 | Austin-Bergstrom International | Austin | TX | Skywest Airlines Inc. | Corporation | BOMBARDIER INC | Fixed Wing Multi-Engine | Turbo-Fan | 2004 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 4156737 | 2008 | 4 | 19 | 6 | 944 | OO | MOD | LAX | 292 | 0 | ... | 0 | Modesto City-County-Harry Sham | Modesto | CA | Skywest Airlines Inc. | Corporation | EMBRAER | Fixed Wing Multi-Engine | Turbo-Prop | 1996 |
| 4156738 | 2008 | 4 | 27 | 7 | 1703 | OO | MOD | SFO | 78 | 0 | ... | 0 | Modesto City-County-Harry Sham | Modesto | CA | Skywest Airlines Inc. | Corporation | EMBRAER | Fixed Wing Multi-Engine | Turbo-Prop | 1996 |
| 4156739 | 2008 | 4 | 28 | 1 | 1703 | OO | MOD | SFO | 78 | 0 | ... | 0 | Modesto City-County-Harry Sham | Modesto | CA | Skywest Airlines Inc. | Corporation | EMBRAER | Fixed Wing Multi-Engine | Turbo-Prop | 1996 |
| 4156740 | 2008 | 4 | 2 | 3 | 944 | OO | MOD | LAX | 292 | 0 | ... | 0 | Modesto City-County-Harry Sham | Modesto | CA | Skywest Airlines Inc. | Corporation | EMBRAER | Fixed Wing Multi-Engine | Turbo-Prop | 1996 |
| 4156741 | 2008 | 4 | 14 | 1 | 921 | OO | MOD | SFO | 78 | 0 | ... | 0 | Modesto City-County-Harry Sham | Modesto | CA | Skywest Airlines Inc. | Corporation | EMBRAER | Fixed Wing Multi-Engine | Turbo-Prop | 1996 |
495430 rows × 21 columns
# confirming that the Origin column matches with the airport or city column
# This provides information on the Airport (takeoff).
flight_data.query('Origin == "SFO"')
| Year | Month | DayofMonth | DayOfWeek | CRSDepTime | UniqueCarrier | Origin | Dest | Distance | Cancelled | ... | Diverted | airport | city | state | Description | type | manufacturer | aircraft_type | engine_type | year | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1025 | 2006 | 1 | 28 | 6 | 900 | US | SFO | PHX | 651 | 0 | ... | 0 | San Francisco International | San Francisco | CA | US Airways Inc. (Merged with America West 9/05... | Corporation | AIRBUS INDUSTRIE | Fixed Wing Multi-Engine | Turbo-Jet | 1999 |
| 1026 | 2006 | 2 | 16 | 4 | 600 | US | SFO | PHX | 651 | 0 | ... | 0 | San Francisco International | San Francisco | CA | US Airways Inc. (Merged with America West 9/05... | Corporation | AIRBUS INDUSTRIE | Fixed Wing Multi-Engine | Turbo-Jet | 1999 |
| 1027 | 2006 | 4 | 4 | 2 | 600 | US | SFO | PHX | 651 | 0 | ... | 0 | San Francisco International | San Francisco | CA | US Airways Inc. (Merged with America West 9/05... | Corporation | AIRBUS INDUSTRIE | Fixed Wing Multi-Engine | Turbo-Jet | 1999 |
| 1028 | 2006 | 4 | 28 | 5 | 700 | US | SFO | LAS | 414 | 0 | ... | 0 | San Francisco International | San Francisco | CA | US Airways Inc. (Merged with America West 9/05... | Corporation | AIRBUS INDUSTRIE | Fixed Wing Multi-Engine | Turbo-Jet | 1999 |
| 1029 | 2007 | 2 | 8 | 4 | 938 | US | SFO | LAS | 414 | 0 | ... | 0 | San Francisco International | San Francisco | CA | US Airways Inc. (Merged with America West 9/05... | Corporation | AIRBUS INDUSTRIE | Fixed Wing Multi-Engine | Turbo-Jet | 1999 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 5215512 | 2008 | 3 | 8 | 6 | 830 | HA | SFO | HNL | 2398 | 0 | ... | 0 | San Francisco International | San Francisco | CA | Hawaiian Airlines Inc. | Corporation | BOEING | Fixed Wing Multi-Engine | Turbo-Fan | 1986 |
| 5215513 | 2008 | 3 | 11 | 2 | 850 | HA | SFO | HNL | 2398 | 0 | ... | 0 | San Francisco International | San Francisco | CA | Hawaiian Airlines Inc. | Corporation | BOEING | Fixed Wing Multi-Engine | Turbo-Fan | 1986 |
| 5215514 | 2008 | 3 | 17 | 1 | 850 | HA | SFO | HNL | 2398 | 0 | ... | 0 | San Francisco International | San Francisco | CA | Hawaiian Airlines Inc. | Corporation | BOEING | Fixed Wing Multi-Engine | Turbo-Fan | 1986 |
| 5215515 | 2008 | 3 | 18 | 2 | 850 | HA | SFO | HNL | 2398 | 0 | ... | 0 | San Francisco International | San Francisco | CA | Hawaiian Airlines Inc. | Corporation | BOEING | Fixed Wing Multi-Engine | Turbo-Fan | 1986 |
| 5215516 | 2008 | 3 | 29 | 6 | 850 | HA | SFO | HNL | 2398 | 0 | ... | 0 | San Francisco International | San Francisco | CA | Hawaiian Airlines Inc. | Corporation | BOEING | Fixed Wing Multi-Engine | Turbo-Fan | 1986 |
105057 rows × 21 columns
We can drop the Origin and UniqueCarrier columns.
flight_data.drop(columns = ['UniqueCarrier','Origin'], axis = 1, inplace= True)
flight_data.sample(10)
| Year | Month | DayofMonth | DayOfWeek | CRSDepTime | Dest | Distance | Cancelled | CancellationCode | Diverted | airport | city | state | Description | type | manufacturer | aircraft_type | engine_type | year | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2625572 | 2008 | 1 | 14 | 1 | 1830 | ORD | 174 | 0 | NaN | 0 | Austin Straubel International | Green Bay | WI | American Eagle Airlines Inc. | Corporation | EMBRAER | Fixed Wing Multi-Engine | Turbo-Fan | 2005 |
| 3674127 | 2008 | 1 | 22 | 2 | 2150 | LAX | 2556 | 0 | NaN | 0 | Honolulu International | Honolulu | HI | Continental Air Lines Inc. | Corporation | BOEING | Fixed Wing Multi-Engine | Turbo-Jet | 2002 |
| 992353 | 2007 | 3 | 18 | 7 | 1635 | EWR | 284 | 0 | NaN | 0 | Norfolk International | Norfolk | VA | Expressjet Airlines Inc. | Corporation | EMBRAER | Fixed Wing Multi-Engine | Turbo-Fan | 2000 |
| 169484 | 2008 | 3 | 6 | 4 | 945 | PIT | 366 | 0 | NaN | 0 | Charlotte/Douglas International | Charlotte | NC | US Airways Inc. (Merged with America West 9/05... | Corporation | AIRBUS INDUSTRIE | Fixed Wing Multi-Engine | Turbo-Jet | 2000 |
| 4466457 | 2006 | 2 | 14 | 2 | 1645 | CLE | 314 | 0 | NaN | 0 | Baltimore-Washington International | Baltimore | MD | Southwest Airlines Co. | Corporation | BOEING | Fixed Wing Multi-Engine | Turbo-Fan | 1999 |
| 2647368 | 2008 | 3 | 13 | 4 | 1135 | ORD | 137 | 0 | NaN | 0 | Kent County International | Grand Rapids | MI | American Eagle Airlines Inc. | Corporation | EMBRAER | Fixed Wing Multi-Engine | Turbo-Fan | 2005 |
| 2057688 | 2006 | 3 | 26 | 7 | 1055 | ATL | 781 | 0 | NaN | 0 | Wichita Mid-Continent | Wichita | KS | Atlantic Southeast Airlines | Corporation | BOMBARDIER INC | Fixed Wing Multi-Engine | Turbo-Fan | 2002 |
| 1395271 | 2007 | 1 | 19 | 5 | 1955 | MCI | 533 | 0 | NaN | 0 | Denver Intl | Denver | CO | United Air Lines Inc. | Corporation | BOEING | Fixed Wing Multi-Engine | Turbo-Fan | 1987 |
| 3392033 | 2006 | 2 | 3 | 5 | 706 | MIA | 1121 | 0 | NaN | 0 | Dallas-Fort Worth International | Dallas-Fort Worth | TX | American Airlines Inc. | Corporation | BOEING | Fixed Wing Multi-Engine | Turbo-Fan | 1990 |
| 217715 | 2006 | 3 | 24 | 5 | 1250 | DEN | 602 | 0 | NaN | 0 | Phoenix Sky Harbor International | Phoenix | AZ | US Airways Inc. (Merged with America West 9/05... | Corporation | BOEING | Fixed Wing Multi-Engine | Turbo-Jet | 1988 |
# Creating a date column
flight_data['date'] = pd.to_datetime(dict(year=flight_data.Year, month=flight_data.Month, day=flight_data.DayofMonth))
flight_data.date.value_counts()
2007-03-09 19989
2007-03-23 19962
2008-03-14 19959
2008-02-15 19958
2007-03-19 19919
...
2007-04-28 4807
2007-04-21 4783
2008-04-02 3639
2008-04-01 3574
2007-04-01 3099
Name: date, Length: 361, dtype: int64
# Creating a column for the day in the week
flight_data['day'] = flight_data.date.dt.day_name()
flight_data.day.value_counts()
Friday 780903 Monday 780280 Thursday 779634 Wednesday 766284 Tuesday 756671 Sunday 725057 Saturday 643913 Name: day, dtype: int64
flight_data.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 5232742 entries, 0 to 5232741 Data columns (total 21 columns): # Column Dtype --- ------ ----- 0 Year int64 1 Month int64 2 DayofMonth int64 3 DayOfWeek int64 4 CRSDepTime int64 5 Dest object 6 Distance int64 7 Cancelled int64 8 CancellationCode object 9 Diverted int64 10 airport object 11 city object 12 state object 13 Description object 14 type object 15 manufacturer object 16 aircraft_type object 17 engine_type object 18 year object 19 date datetime64[ns] 20 day object dtypes: datetime64[ns](1), int64(8), object(12) memory usage: 878.3+ MB
# resetting the month column to hold the name of month
flight_data['Month'] = flight_data.date.dt.month_name()
print(flight_data.Month.value_counts())
flight_data.Month.info()
March 1662807 January 1540337 February 1451360 April 578238 Name: Month, dtype: int64 <class 'pandas.core.series.Series'> Int64Index: 5232742 entries, 0 to 5232741 Series name: Month Non-Null Count Dtype -------------- ----- 5232742 non-null object dtypes: object(1) memory usage: 79.8+ MB
# modify the year column
flight_data['Year'] = flight_data.date.dt.year
flight_data.Year.value_counts()
2008 1901242 2007 1747324 2006 1584176 Name: Year, dtype: int64
# check for NaNs in the Time column.
flight_data.CRSDepTime.isna().sum()
0
# converting the column to type string
flight_data.CRSDepTime = flight_data.CRSDepTime.astype(int).astype(str)
# set the DepTime column to four characters.
flight_data.CRSDepTime = flight_data.CRSDepTime.apply(lambda x: x.zfill(4))
flight_data.CRSDepTime.min(), flight_data.CRSDepTime.max()
('0000', '2359')
# checking for the change.
flight_data.CRSDepTime.sort_values().values[0:10]
array(['0000', '0000', '0001', '0001', '0001', '0001', '0002', '0002',
'0005', '0005'], dtype=object)
# Setting the Time columns to HH:MM format
flight_data.CRSDepTime = flight_data.CRSDepTime.str[:2] + ':' + flight_data.CRSDepTime.str[2:]
# confirming the changes
flight_data.CRSDepTime.values
array(['07:45', '17:05', '07:45', ..., '20:25', '20:25', '20:25'],
dtype=object)
# Add a column for Scheduled departure hour.
flight_data['CRSDepHour'] = flight_data.CRSDepTime.str[:2]
flight_data['CRSDepHour'].unique
<bound method Series.unique of 0 07
1 17
2 07
3 17
4 07
..
5232737 20
5232738 20
5232739 20
5232740 20
5232741 20
Name: CRSDepHour, Length: 5232742, dtype: object>
flight_data.head(10)
| Year | Month | DayofMonth | DayOfWeek | CRSDepTime | Dest | Distance | Cancelled | CancellationCode | Diverted | ... | state | Description | type | manufacturer | aircraft_type | engine_type | year | date | day | CRSDepHour | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2006 | January | 11 | 3 | 07:45 | PHX | 1587 | 0 | NaN | 0 | ... | GA | US Airways Inc. (Merged with America West 9/05... | Corporation | AIRBUS INDUSTRIE | Fixed Wing Multi-Engine | Turbo-Jet | 1999 | 2006-01-11 | Wednesday | 07 |
| 1 | 2006 | January | 18 | 3 | 17:05 | PHX | 1587 | 0 | NaN | 0 | ... | GA | US Airways Inc. (Merged with America West 9/05... | Corporation | AIRBUS INDUSTRIE | Fixed Wing Multi-Engine | Turbo-Jet | 1999 | 2006-01-18 | Wednesday | 17 |
| 2 | 2006 | January | 25 | 3 | 07:45 | PHX | 1587 | 0 | NaN | 0 | ... | GA | US Airways Inc. (Merged with America West 9/05... | Corporation | AIRBUS INDUSTRIE | Fixed Wing Multi-Engine | Turbo-Jet | 1999 | 2006-01-25 | Wednesday | 07 |
| 3 | 2006 | January | 27 | 5 | 17:05 | PHX | 1587 | 0 | NaN | 0 | ... | GA | US Airways Inc. (Merged with America West 9/05... | Corporation | AIRBUS INDUSTRIE | Fixed Wing Multi-Engine | Turbo-Jet | 1999 | 2006-01-27 | Friday | 17 |
| 4 | 2006 | January | 4 | 3 | 07:55 | PHX | 1587 | 0 | NaN | 0 | ... | GA | US Airways Inc. (Merged with America West 9/05... | Corporation | AIRBUS INDUSTRIE | Fixed Wing Multi-Engine | Turbo-Jet | 1999 | 2006-01-04 | Wednesday | 07 |
| 5 | 2006 | March | 1 | 3 | 19:15 | PHX | 1587 | 0 | NaN | 0 | ... | GA | US Airways Inc. (Merged with America West 9/05... | Corporation | AIRBUS INDUSTRIE | Fixed Wing Multi-Engine | Turbo-Jet | 1999 | 2006-03-01 | Wednesday | 19 |
| 6 | 2006 | April | 15 | 6 | 19:15 | PHX | 1587 | 0 | NaN | 0 | ... | GA | US Airways Inc. (Merged with America West 9/05... | Corporation | AIRBUS INDUSTRIE | Fixed Wing Multi-Engine | Turbo-Jet | 1999 | 2006-04-15 | Saturday | 19 |
| 7 | 2006 | April | 21 | 5 | 07:55 | PHX | 1587 | 0 | NaN | 0 | ... | GA | US Airways Inc. (Merged with America West 9/05... | Corporation | AIRBUS INDUSTRIE | Fixed Wing Multi-Engine | Turbo-Jet | 1999 | 2006-04-21 | Friday | 07 |
| 8 | 2006 | April | 24 | 1 | 17:15 | PHX | 1587 | 0 | NaN | 0 | ... | GA | US Airways Inc. (Merged with America West 9/05... | Corporation | AIRBUS INDUSTRIE | Fixed Wing Multi-Engine | Turbo-Jet | 1999 | 2006-04-24 | Monday | 17 |
| 9 | 2007 | January | 19 | 5 | 17:05 | PHX | 1587 | 0 | NaN | 0 | ... | GA | US Airways Inc. (Merged with America West 9/05... | Corporation | AIRBUS INDUSTRIE | Fixed Wing Multi-Engine | Turbo-Jet | 1999 | 2007-01-19 | Friday | 17 |
10 rows × 22 columns
We can drop some columns that are well represented.
flight_data.drop(columns= ['DayofMonth','DayOfWeek'], axis= 1, inplace= True)
flight_data.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 5232742 entries, 0 to 5232741 Data columns (total 20 columns): # Column Dtype --- ------ ----- 0 Year int64 1 Month object 2 CRSDepTime object 3 Dest object 4 Distance int64 5 Cancelled int64 6 CancellationCode object 7 Diverted int64 8 airport object 9 city object 10 state object 11 Description object 12 type object 13 manufacturer object 14 aircraft_type object 15 engine_type object 16 year object 17 date datetime64[ns] 18 day object 19 CRSDepHour object dtypes: datetime64[ns](1), int64(4), object(15) memory usage: 838.4+ MB
# loading the state table.
state = pd.read_csv('50 states and abbreviations.csv')
state.head(10)
| State | Abbreviation | |
|---|---|---|
| 0 | Alabama | AL |
| 1 | Alaska | AK |
| 2 | Arizona | AZ |
| 3 | Arkansas | AR |
| 4 | California | CA |
| 5 | Colorado | CO |
| 6 | Connecticut | CT |
| 7 | Delaware | DE |
| 8 | District of Columbia | DC |
| 9 | Florida | FL |
# merging both DataFrames.
flight_data = flight_data.merge(right= state, how= 'inner', left_on= 'state', right_on= 'Abbreviation')
flight_data.head()
| Year | Month | CRSDepTime | Dest | Distance | Cancelled | CancellationCode | Diverted | airport | city | ... | type | manufacturer | aircraft_type | engine_type | year | date | day | CRSDepHour | State | Abbreviation | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2006 | January | 07:45 | PHX | 1587 | 0 | NaN | 0 | William B Hartsfield-Atlanta Intl | Atlanta | ... | Corporation | AIRBUS INDUSTRIE | Fixed Wing Multi-Engine | Turbo-Jet | 1999 | 2006-01-11 | Wednesday | 07 | Georgia | GA |
| 1 | 2006 | January | 17:05 | PHX | 1587 | 0 | NaN | 0 | William B Hartsfield-Atlanta Intl | Atlanta | ... | Corporation | AIRBUS INDUSTRIE | Fixed Wing Multi-Engine | Turbo-Jet | 1999 | 2006-01-18 | Wednesday | 17 | Georgia | GA |
| 2 | 2006 | January | 07:45 | PHX | 1587 | 0 | NaN | 0 | William B Hartsfield-Atlanta Intl | Atlanta | ... | Corporation | AIRBUS INDUSTRIE | Fixed Wing Multi-Engine | Turbo-Jet | 1999 | 2006-01-25 | Wednesday | 07 | Georgia | GA |
| 3 | 2006 | January | 17:05 | PHX | 1587 | 0 | NaN | 0 | William B Hartsfield-Atlanta Intl | Atlanta | ... | Corporation | AIRBUS INDUSTRIE | Fixed Wing Multi-Engine | Turbo-Jet | 1999 | 2006-01-27 | Friday | 17 | Georgia | GA |
| 4 | 2006 | January | 07:55 | PHX | 1587 | 0 | NaN | 0 | William B Hartsfield-Atlanta Intl | Atlanta | ... | Corporation | AIRBUS INDUSTRIE | Fixed Wing Multi-Engine | Turbo-Jet | 1999 | 2006-01-04 | Wednesday | 07 | Georgia | GA |
5 rows × 22 columns
# drop represented columns.
flight_data.drop(columns = ['state', 'Abbreviation'], axis = 1, inplace= True)
flight_data.head()
| Year | Month | CRSDepTime | Dest | Distance | Cancelled | CancellationCode | Diverted | airport | city | Description | type | manufacturer | aircraft_type | engine_type | year | date | day | CRSDepHour | State | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2006 | January | 07:45 | PHX | 1587 | 0 | NaN | 0 | William B Hartsfield-Atlanta Intl | Atlanta | US Airways Inc. (Merged with America West 9/05... | Corporation | AIRBUS INDUSTRIE | Fixed Wing Multi-Engine | Turbo-Jet | 1999 | 2006-01-11 | Wednesday | 07 | Georgia |
| 1 | 2006 | January | 17:05 | PHX | 1587 | 0 | NaN | 0 | William B Hartsfield-Atlanta Intl | Atlanta | US Airways Inc. (Merged with America West 9/05... | Corporation | AIRBUS INDUSTRIE | Fixed Wing Multi-Engine | Turbo-Jet | 1999 | 2006-01-18 | Wednesday | 17 | Georgia |
| 2 | 2006 | January | 07:45 | PHX | 1587 | 0 | NaN | 0 | William B Hartsfield-Atlanta Intl | Atlanta | US Airways Inc. (Merged with America West 9/05... | Corporation | AIRBUS INDUSTRIE | Fixed Wing Multi-Engine | Turbo-Jet | 1999 | 2006-01-25 | Wednesday | 07 | Georgia |
| 3 | 2006 | January | 17:05 | PHX | 1587 | 0 | NaN | 0 | William B Hartsfield-Atlanta Intl | Atlanta | US Airways Inc. (Merged with America West 9/05... | Corporation | AIRBUS INDUSTRIE | Fixed Wing Multi-Engine | Turbo-Jet | 1999 | 2006-01-27 | Friday | 17 | Georgia |
| 4 | 2006 | January | 07:55 | PHX | 1587 | 0 | NaN | 0 | William B Hartsfield-Atlanta Intl | Atlanta | US Airways Inc. (Merged with America West 9/05... | Corporation | AIRBUS INDUSTRIE | Fixed Wing Multi-Engine | Turbo-Jet | 1999 | 2006-01-04 | Wednesday | 07 | Georgia |
Before we add a column for 'destination_airport', We must rename specific columns to ease our analysis. Columns like:
# renaming the columns.
flight_data.rename(columns={'airport': 'origin_airport', 'Description': 'airline',
'year': 'airplane_issue_year' }, inplace=True)
flight_data.head()
| Year | Month | CRSDepTime | Dest | Distance | Cancelled | CancellationCode | Diverted | origin_airport | city | airline | type | manufacturer | aircraft_type | engine_type | airplane_issue_year | date | day | CRSDepHour | State | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2006 | January | 07:45 | PHX | 1587 | 0 | NaN | 0 | William B Hartsfield-Atlanta Intl | Atlanta | US Airways Inc. (Merged with America West 9/05... | Corporation | AIRBUS INDUSTRIE | Fixed Wing Multi-Engine | Turbo-Jet | 1999 | 2006-01-11 | Wednesday | 07 | Georgia |
| 1 | 2006 | January | 17:05 | PHX | 1587 | 0 | NaN | 0 | William B Hartsfield-Atlanta Intl | Atlanta | US Airways Inc. (Merged with America West 9/05... | Corporation | AIRBUS INDUSTRIE | Fixed Wing Multi-Engine | Turbo-Jet | 1999 | 2006-01-18 | Wednesday | 17 | Georgia |
| 2 | 2006 | January | 07:45 | PHX | 1587 | 0 | NaN | 0 | William B Hartsfield-Atlanta Intl | Atlanta | US Airways Inc. (Merged with America West 9/05... | Corporation | AIRBUS INDUSTRIE | Fixed Wing Multi-Engine | Turbo-Jet | 1999 | 2006-01-25 | Wednesday | 07 | Georgia |
| 3 | 2006 | January | 17:05 | PHX | 1587 | 0 | NaN | 0 | William B Hartsfield-Atlanta Intl | Atlanta | US Airways Inc. (Merged with America West 9/05... | Corporation | AIRBUS INDUSTRIE | Fixed Wing Multi-Engine | Turbo-Jet | 1999 | 2006-01-27 | Friday | 17 | Georgia |
| 4 | 2006 | January | 07:55 | PHX | 1587 | 0 | NaN | 0 | William B Hartsfield-Atlanta Intl | Atlanta | US Airways Inc. (Merged with America West 9/05... | Corporation | AIRBUS INDUSTRIE | Fixed Wing Multi-Engine | Turbo-Jet | 1999 | 2006-01-04 | Wednesday | 07 | Georgia |
We can now add the 'destination_airport' column. We will only add relevant columns from the Airports DataFrame.
# loading the airports DataFrame for relevant columns
airports = airports[['iata', 'airport']]
airports.head()
| iata | airport | |
|---|---|---|
| 0 | 00M | Thigpen |
| 1 | 00R | Livingston Municipal |
| 2 | 00V | Meadow Lake |
| 3 | 01G | Perry-Warsaw |
| 4 | 01J | Hilliard Airpark |
# rename the airport column and merge to the flights_data Dataframe.
airports.rename(columns={'airport': 'destination_airport'}, inplace= True)
airports.head()
| iata | destination_airport | |
|---|---|---|
| 0 | 00M | Thigpen |
| 1 | 00R | Livingston Municipal |
| 2 | 00V | Meadow Lake |
| 3 | 01G | Perry-Warsaw |
| 4 | 01J | Hilliard Airpark |
# merging the DataFrames (airports and flight_data)
flight_data = flight_data.merge(right= airports, how= 'inner', left_on= 'Dest', right_on= 'iata')
flight_data.head()
| Year | Month | CRSDepTime | Dest | Distance | Cancelled | CancellationCode | Diverted | origin_airport | city | ... | manufacturer | aircraft_type | engine_type | airplane_issue_year | date | day | CRSDepHour | State | iata | destination_airport | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2006 | January | 07:45 | PHX | 1587 | 0 | NaN | 0 | William B Hartsfield-Atlanta Intl | Atlanta | ... | AIRBUS INDUSTRIE | Fixed Wing Multi-Engine | Turbo-Jet | 1999 | 2006-01-11 | Wednesday | 07 | Georgia | PHX | Phoenix Sky Harbor International |
| 1 | 2006 | January | 17:05 | PHX | 1587 | 0 | NaN | 0 | William B Hartsfield-Atlanta Intl | Atlanta | ... | AIRBUS INDUSTRIE | Fixed Wing Multi-Engine | Turbo-Jet | 1999 | 2006-01-18 | Wednesday | 17 | Georgia | PHX | Phoenix Sky Harbor International |
| 2 | 2006 | January | 07:45 | PHX | 1587 | 0 | NaN | 0 | William B Hartsfield-Atlanta Intl | Atlanta | ... | AIRBUS INDUSTRIE | Fixed Wing Multi-Engine | Turbo-Jet | 1999 | 2006-01-25 | Wednesday | 07 | Georgia | PHX | Phoenix Sky Harbor International |
| 3 | 2006 | January | 17:05 | PHX | 1587 | 0 | NaN | 0 | William B Hartsfield-Atlanta Intl | Atlanta | ... | AIRBUS INDUSTRIE | Fixed Wing Multi-Engine | Turbo-Jet | 1999 | 2006-01-27 | Friday | 17 | Georgia | PHX | Phoenix Sky Harbor International |
| 4 | 2006 | January | 07:55 | PHX | 1587 | 0 | NaN | 0 | William B Hartsfield-Atlanta Intl | Atlanta | ... | AIRBUS INDUSTRIE | Fixed Wing Multi-Engine | Turbo-Jet | 1999 | 2006-01-04 | Wednesday | 07 | Georgia | PHX | Phoenix Sky Harbor International |
5 rows × 22 columns
# drop unrequired columns.
flight_data.drop(columns = ['Dest', 'iata'], axis = 1, inplace= True)
flight_data.head()
| Year | Month | CRSDepTime | Distance | Cancelled | CancellationCode | Diverted | origin_airport | city | airline | type | manufacturer | aircraft_type | engine_type | airplane_issue_year | date | day | CRSDepHour | State | destination_airport | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2006 | January | 07:45 | 1587 | 0 | NaN | 0 | William B Hartsfield-Atlanta Intl | Atlanta | US Airways Inc. (Merged with America West 9/05... | Corporation | AIRBUS INDUSTRIE | Fixed Wing Multi-Engine | Turbo-Jet | 1999 | 2006-01-11 | Wednesday | 07 | Georgia | Phoenix Sky Harbor International |
| 1 | 2006 | January | 17:05 | 1587 | 0 | NaN | 0 | William B Hartsfield-Atlanta Intl | Atlanta | US Airways Inc. (Merged with America West 9/05... | Corporation | AIRBUS INDUSTRIE | Fixed Wing Multi-Engine | Turbo-Jet | 1999 | 2006-01-18 | Wednesday | 17 | Georgia | Phoenix Sky Harbor International |
| 2 | 2006 | January | 07:45 | 1587 | 0 | NaN | 0 | William B Hartsfield-Atlanta Intl | Atlanta | US Airways Inc. (Merged with America West 9/05... | Corporation | AIRBUS INDUSTRIE | Fixed Wing Multi-Engine | Turbo-Jet | 1999 | 2006-01-25 | Wednesday | 07 | Georgia | Phoenix Sky Harbor International |
| 3 | 2006 | January | 17:05 | 1587 | 0 | NaN | 0 | William B Hartsfield-Atlanta Intl | Atlanta | US Airways Inc. (Merged with America West 9/05... | Corporation | AIRBUS INDUSTRIE | Fixed Wing Multi-Engine | Turbo-Jet | 1999 | 2006-01-27 | Friday | 17 | Georgia | Phoenix Sky Harbor International |
| 4 | 2006 | January | 07:55 | 1587 | 0 | NaN | 0 | William B Hartsfield-Atlanta Intl | Atlanta | US Airways Inc. (Merged with America West 9/05... | Corporation | AIRBUS INDUSTRIE | Fixed Wing Multi-Engine | Turbo-Jet | 1999 | 2006-01-04 | Wednesday | 07 | Georgia | Phoenix Sky Harbor International |
# The NaNs on the CancellationCode column should be replaced with "None"
flight_data['CancellationCode'].replace({np.NaN : "None"}, inplace=True)
flight_data['CancellationCode'].value_counts()
None 5162238 A 18181 B 17299 C 8619 D 11 Name: CancellationCode, dtype: int64
# replace the "1" and "0" in cancelled and diverted columns.
flight_data['Cancelled'].replace({1: "yes", 0: "no"}, inplace=True)
flight_data['Diverted'].replace({1: "yes", 0: "no"}, inplace=True)
flight_data.head()
| Year | Month | CRSDepTime | Distance | Cancelled | CancellationCode | Diverted | origin_airport | city | airline | type | manufacturer | aircraft_type | engine_type | airplane_issue_year | date | day | CRSDepHour | State | destination_airport | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2006 | January | 07:45 | 1587 | no | None | no | William B Hartsfield-Atlanta Intl | Atlanta | US Airways Inc. (Merged with America West 9/05... | Corporation | AIRBUS INDUSTRIE | Fixed Wing Multi-Engine | Turbo-Jet | 1999 | 2006-01-11 | Wednesday | 07 | Georgia | Phoenix Sky Harbor International |
| 1 | 2006 | January | 17:05 | 1587 | no | None | no | William B Hartsfield-Atlanta Intl | Atlanta | US Airways Inc. (Merged with America West 9/05... | Corporation | AIRBUS INDUSTRIE | Fixed Wing Multi-Engine | Turbo-Jet | 1999 | 2006-01-18 | Wednesday | 17 | Georgia | Phoenix Sky Harbor International |
| 2 | 2006 | January | 07:45 | 1587 | no | None | no | William B Hartsfield-Atlanta Intl | Atlanta | US Airways Inc. (Merged with America West 9/05... | Corporation | AIRBUS INDUSTRIE | Fixed Wing Multi-Engine | Turbo-Jet | 1999 | 2006-01-25 | Wednesday | 07 | Georgia | Phoenix Sky Harbor International |
| 3 | 2006 | January | 17:05 | 1587 | no | None | no | William B Hartsfield-Atlanta Intl | Atlanta | US Airways Inc. (Merged with America West 9/05... | Corporation | AIRBUS INDUSTRIE | Fixed Wing Multi-Engine | Turbo-Jet | 1999 | 2006-01-27 | Friday | 17 | Georgia | Phoenix Sky Harbor International |
| 4 | 2006 | January | 07:55 | 1587 | no | None | no | William B Hartsfield-Atlanta Intl | Atlanta | US Airways Inc. (Merged with America West 9/05... | Corporation | AIRBUS INDUSTRIE | Fixed Wing Multi-Engine | Turbo-Jet | 1999 | 2006-01-04 | Wednesday | 07 | Georgia | Phoenix Sky Harbor International |
# modify the long Airline name (US Airways Inc. (Merged with America West 9/05. Reporting for both starting 10/07.)), and
# the cancellationcode with the right values; A = carrier, B = weather, C = NAS, D = security
flight_data['airline'].replace({'US Airways Inc. (Merged with America West 9/05. Reporting for both starting 10/07.)'
: "US Airways Inc."}, inplace=True)
flight_data['CancellationCode'].replace({"A": "carrier", "B": "weather", "C": "NAS", "D": "security"}, inplace=True)
flight_data.head()
| Year | Month | CRSDepTime | Distance | Cancelled | CancellationCode | Diverted | origin_airport | city | airline | type | manufacturer | aircraft_type | engine_type | airplane_issue_year | date | day | CRSDepHour | State | destination_airport | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2006 | January | 07:45 | 1587 | no | None | no | William B Hartsfield-Atlanta Intl | Atlanta | US Airways Inc. | Corporation | AIRBUS INDUSTRIE | Fixed Wing Multi-Engine | Turbo-Jet | 1999 | 2006-01-11 | Wednesday | 07 | Georgia | Phoenix Sky Harbor International |
| 1 | 2006 | January | 17:05 | 1587 | no | None | no | William B Hartsfield-Atlanta Intl | Atlanta | US Airways Inc. | Corporation | AIRBUS INDUSTRIE | Fixed Wing Multi-Engine | Turbo-Jet | 1999 | 2006-01-18 | Wednesday | 17 | Georgia | Phoenix Sky Harbor International |
| 2 | 2006 | January | 07:45 | 1587 | no | None | no | William B Hartsfield-Atlanta Intl | Atlanta | US Airways Inc. | Corporation | AIRBUS INDUSTRIE | Fixed Wing Multi-Engine | Turbo-Jet | 1999 | 2006-01-25 | Wednesday | 07 | Georgia | Phoenix Sky Harbor International |
| 3 | 2006 | January | 17:05 | 1587 | no | None | no | William B Hartsfield-Atlanta Intl | Atlanta | US Airways Inc. | Corporation | AIRBUS INDUSTRIE | Fixed Wing Multi-Engine | Turbo-Jet | 1999 | 2006-01-27 | Friday | 17 | Georgia | Phoenix Sky Harbor International |
| 4 | 2006 | January | 07:55 | 1587 | no | None | no | William B Hartsfield-Atlanta Intl | Atlanta | US Airways Inc. | Corporation | AIRBUS INDUSTRIE | Fixed Wing Multi-Engine | Turbo-Jet | 1999 | 2006-01-04 | Wednesday | 07 | Georgia | Phoenix Sky Harbor International |
# we can check for and drop duplicates
flight_data.duplicated().sum()
143
# drop duplicates
flight_data.drop_duplicates(inplace=True)
flight_data.duplicated().sum()
0
# Convert the column headers to lowercase and then explore the wrangled DataFrame.
flight_data.columns = [x.lower() for x in flight_data.columns]
flight_data.sample(10)
| year | month | crsdeptime | distance | cancelled | cancellationcode | diverted | origin_airport | city | airline | type | manufacturer | aircraft_type | engine_type | airplane_issue_year | date | day | crsdephour | state | destination_airport | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 4340378 | 2007 | March | 20:55 | 479 | no | None | no | Sacramento International | Sacramento | Southwest Airlines Co. | Corporation | BOEING | Fixed Wing Multi-Engine | Turbo-Fan | 1988 | 2007-03-19 | Monday | 20 | California | Portland Intl |
| 2809506 | 2007 | March | 05:45 | 732 | no | None | no | Dallas-Fort Worth International | Dallas-Fort Worth | Delta Air Lines Inc. | Corporation | MCDONNELL DOUGLAS AIRCRAFT CO | Fixed Wing Multi-Engine | Turbo-Fan | 1991 | 2007-03-05 | Monday | 05 | Texas | William B Hartsfield-Atlanta Intl |
| 877224 | 2007 | March | 19:45 | 552 | no | None | no | Adams | Little Rock | American Eagle Airlines Inc. | Corporation | EMBRAER | Fixed Wing Multi-Engine | Turbo-Fan | 2001 | 2007-03-20 | Tuesday | 19 | Arkansas | Chicago O'Hare International |
| 2967374 | 2008 | February | 07:59 | 761 | no | None | no | LaGuardia | New York | Delta Air Lines Inc. | Corporation | BOEING | Fixed Wing Multi-Engine | Turbo-Jet | 1991 | 2008-02-28 | Thursday | 07 | New York | William B Hartsfield-Atlanta Intl |
| 4860498 | 2007 | March | 09:08 | 448 | no | None | no | William B Hartsfield-Atlanta Intl | Atlanta | Atlantic Southeast Airlines | Corporation | BOMBARDIER INC | Fixed Wing Multi-Engine | Turbo-Fan | 2001 | 2007-03-30 | Friday | 09 | Georgia | Monroe Regional |
| 1086578 | 2008 | March | 14:05 | 986 | no | None | no | George Bush Intercontinental | Houston | Expressjet Airlines Inc. | Corporation | EMBRAER | Fixed Wing Multi-Engine | Turbo-Fan | 1998 | 2008-03-25 | Tuesday | 14 | Texas | Piedmont Triad International |
| 3771883 | 2008 | March | 17:15 | 677 | no | None | no | Los Angeles International | Los Angeles | Southwest Airlines Co. | Corporation | BOEING | Fixed Wing Multi-Engine | Turbo-Fan | 2000 | 2008-03-16 | Sunday | 17 | California | Albuquerque International |
| 435576 | 2007 | March | 08:10 | 2075 | no | None | no | Phoenix Sky Harbor International | Phoenix | Southwest Airlines Co. | Corporation | BOEING | Fixed Wing Multi-Engine | Turbo-Fan | 1999 | 2007-03-29 | Thursday | 08 | Arizona | Philadelphia Intl |
| 1921376 | 2007 | February | 09:00 | 627 | no | None | no | Minneapolis-St Paul Intl | Minneapolis | Northwest Airlines Inc. | Corporation | DOUGLAS | Fixed Wing Multi-Engine | Turbo-Jet | 1967 | 2007-02-06 | Tuesday | 09 | Minnesota | Port Columbus Intl |
| 1200130 | 2007 | January | 19:45 | 631 | no | None | no | Newark Intl | Newark | Expressjet Airlines Inc. | Corporation | EMBRAER | Fixed Wing Multi-Engine | Turbo-Fan | 2004 | 2007-01-17 | Wednesday | 19 | New Jersey | McGhee-Tyson |
flight_data.describe()
| year | distance | |
|---|---|---|
| count | 5.206205e+06 | 5.206205e+06 |
| mean | 2.007061e+03 | 7.197380e+02 |
| std | 8.138602e-01 | 5.581469e+02 |
| min | 2.006000e+03 | 1.100000e+01 |
| 25% | 2.006000e+03 | 3.200000e+02 |
| 50% | 2.007000e+03 | 5.760000e+02 |
| 75% | 2.008000e+03 | 9.510000e+02 |
| max | 2.008000e+03 | 4.962000e+03 |
flight_data.shape, (flight_data.query('cancelled == "yes"')).shape
((5206205, 20), (44110, 20))
# convert day, month, and crsdephour into ordered categorical types
ordinal_var_dict = {'day': ['Monday','Tuesday','Wednesday','Thursday','Friday','Saturday','Sunday'],
'month': ['January', 'February', 'March', 'April'],
'crsdephour': ['01', '02', '03','04', '05', '06', '07', '08', '09', '10', '11','12'
,'13','14', '15', '16', '17', '18', '19', '20', '21','22','23','00']}
for var in ordinal_var_dict:
ordered_var = pd.api.types.CategoricalDtype(ordered = True,
categories = ordinal_var_dict[var])
flight_data[var] = flight_data[var].astype(ordered_var)
flight_data.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 5206205 entries, 0 to 5206347 Data columns (total 20 columns): # Column Dtype --- ------ ----- 0 year int64 1 month category 2 crsdeptime object 3 distance int64 4 cancelled object 5 cancellationcode object 6 diverted object 7 origin_airport object 8 city object 9 airline object 10 type object 11 manufacturer object 12 aircraft_type object 13 engine_type object 14 airplane_issue_year object 15 date datetime64[ns] 16 day category 17 crsdephour category 18 state object 19 destination_airport object dtypes: category(3), datetime64[ns](1), int64(2), object(14) memory usage: 729.9+ MB
# save the wrangled data for part 2.
flight_data.to_csv('flights_clean', index= False)
flights = pd.read_csv('flights_clean')
flights.head()
| year | month | crsdeptime | distance | cancelled | cancellationcode | diverted | origin_airport | city | airline | type | manufacturer | aircraft_type | engine_type | airplane_issue_year | date | day | crsdephour | state | destination_airport | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2006 | January | 07:45 | 1587 | no | None | no | William B Hartsfield-Atlanta Intl | Atlanta | US Airways Inc. | Corporation | AIRBUS INDUSTRIE | Fixed Wing Multi-Engine | Turbo-Jet | 1999 | 2006-01-11 | Wednesday | 7 | Georgia | Phoenix Sky Harbor International |
| 1 | 2006 | January | 17:05 | 1587 | no | None | no | William B Hartsfield-Atlanta Intl | Atlanta | US Airways Inc. | Corporation | AIRBUS INDUSTRIE | Fixed Wing Multi-Engine | Turbo-Jet | 1999 | 2006-01-18 | Wednesday | 17 | Georgia | Phoenix Sky Harbor International |
| 2 | 2006 | January | 07:45 | 1587 | no | None | no | William B Hartsfield-Atlanta Intl | Atlanta | US Airways Inc. | Corporation | AIRBUS INDUSTRIE | Fixed Wing Multi-Engine | Turbo-Jet | 1999 | 2006-01-25 | Wednesday | 7 | Georgia | Phoenix Sky Harbor International |
| 3 | 2006 | January | 17:05 | 1587 | no | None | no | William B Hartsfield-Atlanta Intl | Atlanta | US Airways Inc. | Corporation | AIRBUS INDUSTRIE | Fixed Wing Multi-Engine | Turbo-Jet | 1999 | 2006-01-27 | Friday | 17 | Georgia | Phoenix Sky Harbor International |
| 4 | 2006 | January | 07:55 | 1587 | no | None | no | William B Hartsfield-Atlanta Intl | Atlanta | US Airways Inc. | Corporation | AIRBUS INDUSTRIE | Fixed Wing Multi-Engine | Turbo-Jet | 1999 | 2006-01-04 | Wednesday | 7 | Georgia | Phoenix Sky Harbor International |
There are 5.2 million US flights in the dataset with 44,110 of those flights cancelled between 2006 and 2008. The dataset has 20 variables (), most of which are categorical, with month, day, and scheduled departure hour as ordered variables with the following levels.
(LOW) ---------------------------------------------------------------------(HIGH)
month: January, February, March, April.
day: Monday, Tuesday, Wednesday, Thursday, Friday, Saturday, Sunday.
crsdephour : 01, 02, 03, 04, 05, 06, 07, 08, 09, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 00.
I want to investigate the variables that can help with predicting cancellation of flight.
I will pay more attention to Departure time, Day and Airlines.
I expect that airlines (carriers) and very busy states/airports would have the greatest impact on cancelled flights, although the variables above could play another role. We can also explore the possible impact of distance.
Let's start by exploring the individual variables
#let's start with the distribution of month, day and airline
fig, ax = plt.subplots(nrows=3, figsize = [8,10])
order=flight_data['airline'].value_counts().index
base_color = sb.color_palette()[0]
sb.countplot(data=flight_data, y='airline',color=base_color, order=order, ax = ax[0])
sb.countplot(data = flight_data, y = 'day', color = base_color, ax = ax[1])
sb.countplot(data = flight_data, y = 'month', color = base_color, ax = ax[2])
plt.xticks([0.2e6, 0.40e6, 0.6e6, 0.80e6, 1.00e6, 1.20e6, 1.40e6, 1.60e6, 1.80e6],
['200k', '400k', '600k', '800k', '1m', '1.2m', '1.4m', '1.6m', '1.8m']);
Next, let's explore the airport, city, and year variables.
# Exploring the year variable
sb.countplot(data = flight_data, y = 'year', color = base_color)
plt.xticks([0.25e6, 0.50e6, 0.75e6, 1.00e6, 1.25e6, 1.50e6, 1.75e6, 2.00e6],
['250k', '500k', '750k', '1m', '1.25m', '1.5m', '1.75m', '2m']);
The number of scheduled flights appear to increase with the year. 2008 had the most number of scheduled flights.
# Airport varibles
fig, ax = plt.subplots(nrows=2, figsize = [8,10])
sb.countplot(data = flight_data, y = 'origin_airport', color = base_color, ax=ax[0])
sb.countplot(data = flight_data, y = 'destination_airport', color = base_color, ax= ax[1]);
We have to work with the top 10 airports (origin and destination), so that we can reduce the noise or states on the plot.
# confirming the top 10 airports for origin and destination.
flight_data.origin_airport.value_counts()[:10],flight_data.destination_airport.value_counts()[:10]
(William B Hartsfield-Atlanta Intl 292115 Chicago O'Hare International 277287 Dallas-Fort Worth International 198511 Denver Intl 187255 George Bush Intercontinental 175345 Los Angeles International 168096 Phoenix Sky Harbor International 164357 McCarran International 131325 Newark Intl 121420 Salt Lake City Intl 116472 Name: origin_airport, dtype: int64, William B Hartsfield-Atlanta Intl 290020 Chicago O'Hare International 275291 Dallas-Fort Worth International 197537 Denver Intl 187124 George Bush Intercontinental 174640 Los Angeles International 166163 Phoenix Sky Harbor International 163960 McCarran International 131535 Newark Intl 119149 Salt Lake City Intl 116194 Name: destination_airport, dtype: int64)
The top 10 airports for both Origin and Destination flights are the same. This works well for our analysis.
# explore the airport variables.
fig, ax = plt.subplots(nrows=2, figsize = [8,10])
top_airport = list(flight_data.origin_airport.value_counts()[:10].index)
sample_origin = flight_data[flight_data['origin_airport'].isin(top_airport)]
sample_destination = flight_data[flight_data['destination_airport'].isin(top_airport)]
sb.countplot(data = sample_origin, y = 'origin_airport', color = base_color, ax=ax[0], order= top_airport)
sb.countplot(data = sample_destination, y = 'destination_airport', color = base_color, ax= ax[1], order= top_airport);
The top airports (origin and destination) show almost the same number of flights. We will see how this impacts cancelled flights. One would expect that busy Airports would tend to cancel more flights than the less busy ones.
Next, we explore the DepHour variable.
# exploring the Depature hour variable.
plt.figure(figsize=[8, 6])
sb.countplot(data=flight_data, y='crsdephour',color=base_color)
plt.ylabel('Departure Hour')
for i in range(flight_data.crsdephour.value_counts().shape[0]):
count = flight_data.crsdephour.value_counts(sort= False)[i]
pct_string = '{:0.5f}%'.format(100*count/flight_data.shape[0])
plt.text(count/count,i, pct_string, va='center',bbox=dict(facecolor='white', alpha=0.3));
7.1% flights were booked for 6am. More flights were generally booked in the morning. 5pm also saw a shoot in booked flights (6.78%).
3am appears to have the least number of scheduled flights. We will how this impacts other variables as well.
Up next, we should look at Distance
# exploring the distance variable
fig, ax = plt.subplots(nrows=2, figsize=[8, 10])
bin_size = 0.105
bins = 10 ** np.arange(np.log10(flight_data['distance'].min()),
np.log10(flight_data['distance'].max())+bin_size, bin_size)
ax[0].hist(data=flight_data, x='distance',color=base_color)
ax[0].set_xlabel('Distance (miles)');
ax[1].hist(data=flight_data, x='distance',color=base_color, bins= bins)
ax[1].set_xscale('log')
ticks = [10, 30, 100, 300, 1000, 3000, 10000]
labels = ['{}'.format(i) for i in ticks]
ax[1].set_xticks(ticks,labels)
ax[1].set_xlabel('Log of Distance (miles)');
Distance is skewed to the right, with a lot of flights around 1000 miles, and few on the high end. When plotted on a log-scale, the distance distribution looks roughly unimodal, with the highest peak around 700 miles.
Next up, the cancelled flights and cancellation code.
# we will use a pie chart for this exploration.
plt.figure(figsize = [8, 5])
fig.subplots_adjust(hspace=6)
plt.subplot(1, 2, 1)
fmt = flight_data["cancelled"].value_counts()/flight_data.cancelled.value_counts().sum()
plt.pie(x=flight_data["cancelled"].value_counts(),
labels=flight_data["cancelled"].value_counts().index, startangle=90,
autopct='%.f%%',wedgeprops = {'width' : 0.8});
plt.legend(bbox_to_anchor=(1, 1.02), loc='upper left',title="Cancelled");
plt.axis('square')
#plt.title('Flights');
plt.subplot(1, 2, 2)
cancelled_flights = flight_data[flight_data.cancellationcode != 'None']
count = cancelled_flights.cancellationcode.value_counts()
fmt = count/count.sum()
plt.pie(x=cancelled_flights.cancellationcode.value_counts(),
labels=count.index, startangle=90,
autopct='%.2f%%',wedgeprops = {'width' : 0.8});
plt.legend(bbox_to_anchor=(1, 1.02), loc='upper left',title="Reason");
plt.axis('square')
#plt.title('Cancellation Reason');
plt.suptitle('Distribution of Flights and Cancelled Flights');
This implies that 1% of flights were cancelled, and majority (41.22%) of the cancelled flights can be attributed to Airlines (Carriers) and then Weather and so on...
# let's look at state.
plt.figure(figsize=[10,7])
count = flight_data.state.value_counts().index
sb.countplot(data= flight_data, y='state', color= base_color, order= count);
# Lets look at the busiest states in terms of scheduled flights (more than 100,000 scheduled flights).
state_count = flight_data.state.value_counts()
idx = np.sum(state_count > 100000)
state_order = state_count.index[:idx]
sb.countplot(data= flight_data, y='state', order= state_order, color= base_color);
The distance variable had a lot of values at the lower end (0 - 1000 miles), so I did a log-transformation to get better insight. With the transformation, distance appeared to be roughly unimodal, with the highest peak around 700 miles..
When investigating the airports variable, a lot of airports were plotted. I adjusted the plot to reflect only the top 10 airports for both Origin and Departure. Also, for the cancelled flights, I filtered the entire data to just the flights that were cancelled.
To start off with, I want to look at the relationship among features in the data. We will attempt to address the questions below:
# What is the relationship between airports and cancelled flights?
fig, ax = plt.subplots( 2, figsize=(10, 10))
sb.countplot(data = cancelled_flights, y = 'origin_airport', hue= 'cancelled' , order= top_airport, ax= ax[0]);
sb.countplot(data = cancelled_flights, y = 'origin_airport', hue= 'cancellationcode' , order= top_airport, ax= ax[1]);
fig.suptitle('Distribution of Flights by Top 10 Airports and Reasons for Cancelled Flights', size=15);
# What are the distribution of cancelled flights by year, month and day?
fig, ax = plt.subplots(1, 3, figsize=(14, 6))
sb.countplot(data=cancelled_flights, x='year', hue='cancellationcode', ax=ax[0])
sb.countplot(data=cancelled_flights, x='month', hue='cancellationcode', ax=ax[1])
ax[1].set_ylabel("")
sb.countplot(data=cancelled_flights, x='day', hue='cancellationcode', ax=ax[2])
ax[2].tick_params(axis='x',rotation=30)
ax[2].legend(bbox_to_anchor=(1,0.5),framealpha=1)
ax[2].set_ylabel("");
fig.suptitle('Distribution of Cancelled flights by Year, Month and Day', size=15);
# Which airline has the maximum proportion of cancelled flights and what are the reasons of cancellation?
fig, ax = plt.subplots( 2, figsize=(10, 10))
order_airline = cancelled_flights.airline.value_counts().index
order_airline_2 =flight_data.airline.value_counts().index
sb.countplot(data = flight_data, y = 'airline', order= order_airline_2, label='Total', ax= ax[0], color=base_color)
sb.countplot(data = cancelled_flights, y = 'airline', order= order_airline_2, label='Cancelled' , ax= ax[0], color='red');
ax[0].set_title('A')
ax[0].legend(title="Flights");
sb.countplot(data = cancelled_flights, y = 'airline', hue= 'cancellationcode', order= order_airline , ax= ax[1]);
ax[1].set_title('B');
fig.suptitle('Distribution of Flights by Airline (A) and Airlines by Cancelled Flights (B)', size=15);
Up next, any relationship betweeen Distance and Cancelled Flights?
# compute the logarithm of distance to make plotting easier.
def log_trans(x, inverse = False):
""" quick function for computing log and power operations """
if not inverse:
return np.log10(x)
else:
return np.power(10, x)
flight_data['log_distance'] = flight_data['distance'].apply(log_trans)
cancelled_flights['log_distance'] = cancelled_flights['distance'].apply(log_trans)
# Any relationship between Cancelled Flights and Distance?
fig, ax = plt.subplots(1, 2, figsize=(10, 5))
fig.subplots_adjust(wspace=0.8)
sb.violinplot(data=cancelled_flights, y='cancellationcode', x='log_distance', inner='quartile', ax= ax[0]);
ax[0].set_title('A')
ax[0].set_xticks(log_trans(np.array([30, 100, 300, 1e3, 3e3, 1e4])))
ax[0].set_xticklabels(['30', '100', '300', '1k', '3k', '10k'])
ax[0].set_xlabel('Log of Distance (miles)');
sb.violinplot(data=cancelled_flights, y='cancellationcode', x='distance', inner='quartile', ax= ax[1]);
ax[1].set_title('B');
ax[1].set_xlabel('Distance (miles)');
plt.suptitle('Distribution of Cancelled Flights by (A) Log of Distance and (B) Distance');
# What about Airlines and Distance?
sb.boxenplot(data = flight_data, y= 'airline', x= 'log_distance', color= base_color)
plt.xticks(log_trans(np.array(ticks)), labels);
plt.title('All Airlines by Distance');
# Completed Flights and Cancelled Flights by Distance?
fig, ax = plt.subplots(1, 2, figsize=(10, 7))
fig.subplots_adjust(wspace=0.9)
completed_flights = flight_data[flight_data.cancellationcode == 'None']
sb.boxenplot(data = completed_flights, y= 'airline', x= 'log_distance', color= base_color, ax= ax[0], order= order_airline)
ax[0].set_xticks(log_trans(np.array(ticks)), labels);
ax[0].set_title('A = Completed Flights\n Ordered by Cancelled Flights');
sb.boxenplot(data = cancelled_flights, y= 'airline', x= 'log_distance', color= 'red', ax= ax[1], order= order_airline)
ax[1].set_xticks(log_trans(np.array(ticks)), labels);
ax[1].set_title('B = Cancelled Flights\n');
plt.suptitle('Distribution of Airlines by Log of Distance');
What about State & Distance; State & Cancelled Flights; State and Airlines?
# Busiest States and Log_distance.
g = sb.FacetGrid(data = flight_data, col = 'state', col_wrap = 4,
col_order = state_order, sharex= False)
g.map(plt.hist, 'log_distance').set(xscale= 'log');
g.set_titles('{col_name}')
g.set(xticks= log_trans(np.array(ticks)), xticklabels= labels)
g.set(xlim = log_trans(np.array(80)));
# Busiest States and Log_distance using a Violinplot.
g = sb.FacetGrid(data = flight_data, col = 'state', col_wrap = 4,
col_order = state_order)
g.map(sb.violinplot, data= flight_data, x='log_distance').set(xscale= 'log');
g.set_titles('{col_name}')
g.set(xticks= log_trans(np.array(ticks)), xticklabels= labels)
g.set(xlim = log_trans(np.array(80)));
Let's look at the relationship between Busiest State & Cancelled Flights.
# State & Cancelled Flights...
plt.figure(figsize=[10,10])
sb.countplot(data=cancelled_flights, hue='cancellationcode', y='state', order= state_order);
# Let's explore States and Top Airlines
g = sb.FacetGrid(data = flight_data, col = 'state', col_wrap = 3,
col_order = state_order, sharex= False)
g.map(sb.countplot, data= flight_data, y='airline', order= order_airline_2)
g.set_titles('{col_name}');
# States and Top 6 Airlines with cancelled flights
g = sb.FacetGrid(data = cancelled_flights, col = 'state', col_wrap = 3,
col_order = state_order, sharex= False)
g.map(sb.countplot, data= cancelled_flights, y='airline', order= order_airline[:6], color='red')
g.set_titles('{col_name}');
# What about Departure hour and Day?
g = sb.FacetGrid(data=flight_data, col= 'crsdephour', col_wrap= 4 )
g.map(sb.countplot, data=flight_data, y='day')
g.set_titles('{col_name}');
# What about Departure hour and Day in terms of cancelled flights?
g = sb.FacetGrid(data=cancelled_flights, col= 'crsdephour', col_wrap= 4 )
g.map(sb.countplot, data=cancelled_flights, y='day', color='red')
g.set_titles('{col_name}');
What are the preferred times or days for flights to occur?
Are there any perfect period for the worst airline?
# Let's us explore the relationships among; day, departure hour, and airlines
g = sb.FacetGrid(data=cancelled_flights, row= 'crsdephour', col= 'day', margin_titles= True)
g.map(sb.countplot, data=cancelled_flights, y='airline', color='red', order= order_airline)
g.set_titles('{col_name}', '{row_name}');
# Let's us explore the relationships among; day, departure hour, year, and cancelled flights
g = sb.FacetGrid(data=cancelled_flights, row= 'crsdephour', col= 'day', margin_titles= True)
g.map(sb.countplot, data=cancelled_flights, y='year', color= 'red',hue= 'cancellationcode')
g.set_titles('{col_name}', '{row_name}');
# Let's us explore the relationships among; day, departure hour, and state for cancelled flights
g = sb.FacetGrid(data=cancelled_flights, row= 'crsdephour', col= 'day', margin_titles= True)
g.map(sb.countplot, data=cancelled_flights, y='state', color='red', order= state_order)
g.set_titles('{col_name}', '{row_name}');
Would it be possible to identify the best airports where Mesa Airlines Inc. operates, despite their poor record of cancelled flights?
# filttering for the Saturday scheduled flights for Mesa Airlines
cancelled_flights_sub = cancelled_flights[(cancelled_flights.airline == 'Mesa Airlines Inc.')
& (cancelled_flights.day == 'Saturday')]
cancelled_flights_sub.head()
| year | month | crsdeptime | distance | cancelled | cancellationcode | diverted | origin_airport | city | airline | ... | manufacturer | aircraft_type | engine_type | airplane_issue_year | date | day | crsdephour | state | destination_airport | log_distance | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 7471 | 2008 | March | 05:34 | 1009 | yes | carrier | no | George Bush Intercontinental | Houston | Mesa Airlines Inc. | ... | BOMBARDIER INC | Fixed Wing Multi-Engine | Turbo-Fan | 2004 | 2008-03-22 | Saturday | 05 | Texas | Phoenix Sky Harbor International | 3.003891 |
| 8164 | 2007 | March | 08:50 | 347 | yes | carrier | no | El Paso International | El Paso | Mesa Airlines Inc. | ... | BOMBARDIER INC | Fixed Wing Multi-Engine | Turbo-Fan | 2004 | 2007-03-24 | Saturday | 08 | Texas | Phoenix Sky Harbor International | 2.540329 |
| 8470 | 2008 | February | 06:00 | 347 | yes | carrier | no | El Paso International | El Paso | Mesa Airlines Inc. | ... | BOMBARDIER INC | Fixed Wing Multi-Engine | Turbo-Fan | 2004 | 2008-02-02 | Saturday | 06 | Texas | Phoenix Sky Harbor International | 2.540329 |
| 9122 | 2008 | April | 08:45 | 347 | yes | carrier | no | El Paso International | El Paso | Mesa Airlines Inc. | ... | NaN | NaN | NaN | NaN | 2008-04-19 | Saturday | 08 | Texas | Phoenix Sky Harbor International | 2.540329 |
| 9187 | 2006 | April | 06:05 | 843 | yes | carrier | no | San Antonio International | San Antonio | Mesa Airlines Inc. | ... | BOMBARDIER INC | Fixed Wing Multi-Engine | Turbo-Fan | None | 2006-04-01 | Saturday | 06 | Texas | Phoenix Sky Harbor International | 2.925828 |
5 rows × 21 columns
# exploring the best 10 airports (with the least cancelled flights)
order_sub = cancelled_flights_sub.origin_airport.value_counts().index[-10:]
g = sb.FacetGrid(data=cancelled_flights_sub, row= 'crsdephour', col= 'cancellationcode', margin_titles= True)
g.map(sb.countplot, data=cancelled_flights_sub, y='origin_airport', color='red',order= order_sub)
g.set_titles(col_template='{col_name}',row_template='{row_name}');
1. 'Rapid City Regional'
2. 'Wichita Mid-Continent'
3. 'Seattle-Tacoma Intl',
4. 'Louisville International-Standiford '
5. 'Joe Foss ',
6. 'Kent County International'
7. 'Minneapolis-St Paul Intl'
8. 'Sacramento International'
9. 'Quad City'
10. 'Daytona Beach International'.
It is necessary to explore Airlines (Southwest > American Airways > Expressjet Airlines > American Eagle > Continental Air) with 100% flight completion. We will demonstrate the relationship in terms of departure hour, day and airports
# Creating a dataframe for this analysis.
perfect_airlines = ['Southwest Airlines Co.', 'American Airlines Inc.', 'Expressjet Airlines Inc.',
'American Eagle Airlines Inc.', 'Continental Air Lines Inc.']
flight_data_perfect = flight_data[flight_data.airline.isin(perfect_airlines)]
flight_data_perfect.head()
| year | month | crsdeptime | distance | cancelled | cancellationcode | diverted | origin_airport | city | airline | ... | manufacturer | aircraft_type | engine_type | airplane_issue_year | date | day | crsdephour | state | destination_airport | log_distance | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 10411 | 2006 | March | 15:55 | 1009 | no | None | no | George Bush Intercontinental | Houston | Expressjet Airlines Inc. | ... | EMBRAER | Fixed Wing Multi-Engine | Turbo-Fan | 2004 | 2006-03-12 | Sunday | 15 | Texas | Phoenix Sky Harbor International | 3.003891 |
| 10412 | 2006 | April | 15:45 | 1009 | no | None | no | George Bush Intercontinental | Houston | Expressjet Airlines Inc. | ... | EMBRAER | Fixed Wing Multi-Engine | Turbo-Fan | 2004 | 2006-04-03 | Monday | 15 | Texas | Phoenix Sky Harbor International | 3.003891 |
| 10413 | 2006 | April | 19:45 | 1009 | no | None | no | George Bush Intercontinental | Houston | Expressjet Airlines Inc. | ... | EMBRAER | Fixed Wing Multi-Engine | Turbo-Fan | 2004 | 2006-04-21 | Friday | 19 | Texas | Phoenix Sky Harbor International | 3.003891 |
| 10414 | 2007 | January | 16:05 | 1009 | no | None | no | George Bush Intercontinental | Houston | Expressjet Airlines Inc. | ... | EMBRAER | Fixed Wing Multi-Engine | Turbo-Fan | 2004 | 2007-01-19 | Friday | 16 | Texas | Phoenix Sky Harbor International | 3.003891 |
| 10415 | 2007 | January | 19:25 | 1009 | no | None | no | George Bush Intercontinental | Houston | Expressjet Airlines Inc. | ... | EMBRAER | Fixed Wing Multi-Engine | Turbo-Fan | 2004 | 2007-01-16 | Tuesday | 19 | Texas | Phoenix Sky Harbor International | 3.003891 |
5 rows × 21 columns
# Perfect Airlines by Top Airports.
g = sb.FacetGrid(data=flight_data_perfect, col= 'day', row= 'crsdephour', margin_titles= True)
g.map(sb.countplot, data=flight_data_perfect, y='origin_airport', order= top_airport)
g.set_titles(col_template='{col_name}',row_template='{row_name}');
Yes, we saw that at particular times of the day, even the worst airlines were able to complete flights.
Well, we saw that the perfect airlines had a free window between 3am - 5am daily.
1. 'Rapid City Regional'
2. 'Wichita Mid-Continent'
3. 'Seattle-Tacoma Intl',
4. 'Louisville International-Standiford '
5. 'Joe Foss ',
6. 'Kent County International'
7. 'Minneapolis-St Paul Intl'
8. 'Sacramento International'
9. 'Quad City'
10. 'Daytona Beach International'.